Multimodal RAG - RAGLight

Standard RAG pipelines often ignore images inside PDFs. RAGLight’s Multimodal Pipeline uses Vision-Language Models (like GPT-4o or Mistral Vision) to “see” diagrams, charts, and photos inside your documents and index their descriptions.

You need a VLM-capable model (e.g., llava via Ollama or gpt-4o via OpenAI) for this to work effectively.

Implementation

multimodal_rag.py

from raglight.rag.simple_rag_api import RAGPipeline
from raglight.config.settings import Settings
from raglight.config.rag_config import RAGConfig
from raglight.processors import VlmPDFProcessor

# 1. Enable Multimodal Processing
# This processor extracts images, converts them to Base64,
# and uses the LLM to generate searchable captions.
config = RAGConfig(
    provider=Settings.OLLAMA,
    llm="llava",  # Ensure you have 'ollama pull llava'
    knowledge_base="./technical_manuals",
    # Override default processors if necessary (depending on your latest API)
    # processor_overrides={"pdf": VlmPDFProcessor}
)

pipeline = RAGPipeline(config)
pipeline.build()

# Now you can ask questions about charts or diagrams!
response = pipeline.generate("Describe the architecture diagram on page 3.")
print(response)

Examples

​Implementation

Implementation