Standard RAG pipelines often ignore images inside PDFs. RAGLight’s Multimodal Pipeline uses Vision-Language Models (like GPT-4o or Mistral Vision) to “see” diagrams, charts, and photos inside your documents and index their descriptions.
You need a VLM-capable model (e.g., llava via Ollama or gpt-4o via OpenAI)
for this to work effectively.
Implementation
from raglight.rag.simple_rag_api import RAGPipeline
from raglight.config.settings import Settings
from raglight.config.rag_config import RAGConfig
from raglight.processors import VlmPDFProcessor
# 1. Enable Multimodal Processing
# This processor extracts images, converts them to Base64,
# and uses the LLM to generate searchable captions.
config = RAGConfig(
provider=Settings.OLLAMA,
llm="llava", # Ensure you have 'ollama pull llava'
knowledge_base="./technical_manuals",
# Override default processors if necessary (depending on your latest API)
# processor_overrides={"pdf": VlmPDFProcessor}
)
pipeline = RAGPipeline(config)
pipeline.build()
# Now you can ask questions about charts or diagrams!
response = pipeline.generate("Describe the architecture diagram on page 3.")
print(response)