Skip to main content

Readers & Document Processors

Overview

Readers (called Document Processors in the codebase) define how raw files are read and transformed into documents that can be embedded and stored. In RAGLight, readers are responsible for:
  • understanding file formats (PDF, code, text, …)
  • extracting meaningful content
  • chunking content into embedding-ready pieces
  • optionally extracting structured elements (e.g. code classes)
They operate after loaders and before embeddings:
Loaders → Readers / Processors → Chunks → Embeddings → Vector Store

Design philosophy

RAGLight intentionally separates:
  • where data comes from (loaders)
  • how data is interpreted (readers)
This separation allows you to:
  • reuse the same data source with different parsing strategies
  • override readers without touching the ingestion logic
  • experiment with advanced parsing (VLMs, custom code analysis, etc.)

What is a Document Processor?

A Document Processor takes a file path and returns structured documents. Conceptually, a processor:
  1. reads a file
  2. extracts relevant content
  3. splits it into chunks
  4. returns:
    • document chunks
    • optional structured documents (e.g. classes)
All processors inherit from a common base and expose a unified interface.

Processor selection

RAGLight uses a DocumentProcessorFactory to automatically select the right processor based on file extension. This means:
  • you do not manually assign processors
  • the ingestion pipeline chooses the correct reader for each file

Built-in readers

RAGLight ships with built-in processors for common file types.

TextProcessor

Used for plain text formats:
  • .txt
  • .md
  • .html
It extracts raw text and chunks it into overlapping segments suitable for embeddings.

PDFProcessor

Used for .pdf files. The default PDF processor:
  • extracts textual content
  • ignores images and diagrams
  • chunks the extracted text
This is sufficient for many documentation-style PDFs.

CodeProcessor

Used for source code files:
  • .py, .js, .ts, .java, .cpp, .cs, …
The code processor:
  • extracts code blocks
  • optionally extracts class or function signatures
  • produces:
    • standard chunks (for semantic search)
    • class documents (stored separately)
This enables code-aware retrieval and class-level search.

Metadata handling

Each chunk includes metadata such as:
  • source path
  • file type
  • line numbers or identifiers (when available)
RAGLight ensures all metadata is JSON-safe by flattening complex objects (lists, dicts → strings) before storage.

Overriding default readers

One of RAGLight’s strengths is the ability to override readers per file extension. This is done via the custom_processors argument when building a vector store.

Example: Custom PDF reader with a VLM

A common advanced use case is processing PDFs with diagrams or images using a Vision-Language Model (VLM).
from raglight.document_processing.vlm_pdf_processor import VlmPDFProcessor
from raglight.llm.ollama_model import OllamaModel
from raglight.rag.builder import Builder
from raglight.config.settings import Settings

vlm = OllamaModel(
    model_name="ministral-3:3b",
    system_prompt="You are a technical documentation visual assistant.",
)

custom_processors = {
    "pdf": VlmPDFProcessor(vlm)
}

vector_store = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name=Settings.DEFAULT_EMBEDDINGS_MODEL)
    .with_vector_store(
        Settings.CHROMA,
        persist_directory="./defaultDb",
        collection_name="default",
        custom_processors=custom_processors,
    )
    .build_vector_store()
)

vector_store.ingest(data_path="./data")
With this setup:
  • PDFs are processed using the VLM-based reader
  • other file types keep their default processors

Readers and vector stores

Readers are vector-store agnostic. They:
  • produce Document objects
  • do not know where embeddings are stored
  • do not perform retrieval
This allows the same readers to work with any current or future vector store backend.

When to customize readers

You may want to override readers when:
  • PDFs contain diagrams or screenshots
  • code requires custom parsing rules
  • domain-specific formats need special handling
Readers are the correct extension point for content understanding.

Summary

  • Readers define how files are parsed and chunked
  • They run after loaders and before embeddings
  • RAGLight includes built-in readers for text, PDFs, and code
  • Readers can be overridden per extension
  • Advanced readers (e.g. VLM-based) unlock multimodal RAG