Readers & Document Processors

Overview

Readers (called Document Processors in the codebase) define how raw files are read and transformed into documents that can be embedded and stored. In RAGLight, readers are responsible for:

understanding file formats (PDF, code, text, …)
extracting meaningful content
chunking content into embedding-ready pieces
optionally extracting structured elements (e.g. code classes)

They operate after loaders and before embeddings:

Loaders → Readers / Processors → Chunks → Embeddings → Vector Store

Design philosophy

RAGLight intentionally separates:

where data comes from (loaders)
how data is interpreted (readers)

This separation allows you to:

reuse the same data source with different parsing strategies
override readers without touching the ingestion logic
experiment with advanced parsing (VLMs, custom code analysis, etc.)

What is a Document Processor?

A Document Processor takes a file path and returns structured documents. Conceptually, a processor:

reads a file
extracts relevant content
splits it into chunks
returns:
- document chunks
- optional structured documents (e.g. classes)

All processors inherit from a common base and expose a unified interface.

Processor selection

RAGLight uses a DocumentProcessorFactory to automatically select the right processor based on file extension. This means:

you do not manually assign processors
the ingestion pipeline chooses the correct reader for each file

Built-in readers

RAGLight ships with built-in processors for common file types.

TextProcessor

Used for plain text formats:

.txt
.md
.html

It extracts raw text and chunks it into overlapping segments suitable for embeddings.

PDFProcessor

Used for .pdf files. The default PDF processor:

extracts textual content
ignores images and diagrams
chunks the extracted text

This is sufficient for many documentation-style PDFs.

CodeProcessor

Used for source code files:

.py, .js, .ts, .java, .cpp, .cs, …

The code processor:

extracts code blocks
optionally extracts class or function signatures
produces:
- standard chunks (for semantic search)
- class documents (stored separately)

This enables code-aware retrieval and class-level search.

Metadata handling

Each chunk includes metadata such as:

source path
file type
line numbers or identifiers (when available)

RAGLight ensures all metadata is JSON-safe by flattening complex objects (lists, dicts → strings) before storage.

Overriding default readers

One of RAGLight’s strengths is the ability to override readers per file extension. This is done via the custom_processors argument when building a vector store.

Example: Custom PDF reader with a VLM

A common advanced use case is processing PDFs with diagrams or images using a Vision-Language Model (VLM).

from raglight.document_processing.vlm_pdf_processor import VlmPDFProcessor
from raglight.llm.ollama_model import OllamaModel
from raglight.rag.builder import Builder
from raglight.config.settings import Settings

vlm = OllamaModel(
    model_name="ministral-3:3b",
    system_prompt="You are a technical documentation visual assistant.",
)

custom_processors = {
    "pdf": VlmPDFProcessor(vlm)
}

vector_store = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name=Settings.DEFAULT_EMBEDDINGS_MODEL)
    .with_vector_store(
        Settings.CHROMA,
        persist_directory="./defaultDb",
        collection_name="default",
        custom_processors=custom_processors,
    )
    .build_vector_store()
)

vector_store.ingest(data_path="./data")

With this setup:

PDFs are processed using the VLM-based reader
other file types keep their default processors

Readers and vector stores

Readers are vector-store agnostic. They:

produce Document objects
do not know where embeddings are stored
do not perform retrieval

This allows the same readers to work with any current or future vector store backend.

When to customize readers

You may want to override readers when:

PDFs contain diagrams or screenshots
code requires custom parsing rules
domain-specific formats need special handling

Readers are the correct extension point for content understanding.

Summary

Readers define how files are parsed and chunked
They run after loaders and before embeddings
RAGLight includes built-in readers for text, PDFs, and code
Readers can be overridden per extension
Advanced readers (e.g. VLM-based) unlock multimodal RAG

Core Concepts

Pipelines

Readers & Document Processors

Readers & Document Processors

Overview

Design philosophy

What is a Document Processor?

Processor selection

Built-in readers

TextProcessor

PDFProcessor

CodeProcessor

Metadata handling

Overriding default readers

Example: Custom PDF reader with a VLM

Readers and vector stores

When to customize readers

Summary

Core Concepts

Pipelines

​Readers & Document Processors

​Overview

​Design philosophy

​What is a Document Processor?

​Processor selection

​Built-in readers

​TextProcessor

​PDFProcessor

​CodeProcessor

​Metadata handling

​Overriding default readers

​Example: Custom PDF reader with a VLM

​Readers and vector stores

​When to customize readers

​Summary

Readers & Document Processors

Overview

Design philosophy

What is a Document Processor?

Processor selection

Built-in readers

TextProcessor

PDFProcessor

CodeProcessor

Metadata handling

Overriding default readers

Example: Custom PDF reader with a VLM

Readers and vector stores

When to customize readers

Summary