Readers & Document Processors
Overview
Readers (called Document Processors in the codebase) define how raw files are read and transformed into documents that can be embedded and stored. In RAGLight, readers are responsible for:- understanding file formats (PDF, code, text, …)
- extracting meaningful content
- chunking content into embedding-ready pieces
- optionally extracting structured elements (e.g. code classes)
Design philosophy
RAGLight intentionally separates:- where data comes from (loaders)
- how data is interpreted (readers)
- reuse the same data source with different parsing strategies
- override readers without touching the ingestion logic
- experiment with advanced parsing (VLMs, custom code analysis, etc.)
What is a Document Processor?
A Document Processor takes a file path and returns structured documents. Conceptually, a processor:- reads a file
- extracts relevant content
- splits it into chunks
-
returns:
- document chunks
- optional structured documents (e.g. classes)
Processor selection
RAGLight uses aDocumentProcessorFactory to automatically select the right processor based on file extension.
This means:
- you do not manually assign processors
- the ingestion pipeline chooses the correct reader for each file
Built-in readers
RAGLight ships with built-in processors for common file types.TextProcessor
Used for plain text formats:.txt.md.html
PDFProcessor
Used for.pdf files.
The default PDF processor:
- extracts textual content
- ignores images and diagrams
- chunks the extracted text
CodeProcessor
Used for source code files:.py,.js,.ts,.java,.cpp,.cs, …
- extracts code blocks
- optionally extracts class or function signatures
-
produces:
- standard chunks (for semantic search)
- class documents (stored separately)
Metadata handling
Each chunk includes metadata such as:- source path
- file type
- line numbers or identifiers (when available)
Overriding default readers
One of RAGLight’s strengths is the ability to override readers per file extension. This is done via thecustom_processors argument when building a vector store.
Example: Custom PDF reader with a VLM
A common advanced use case is processing PDFs with diagrams or images using a Vision-Language Model (VLM).- PDFs are processed using the VLM-based reader
- other file types keep their default processors
Readers and vector stores
Readers are vector-store agnostic. They:- produce
Documentobjects - do not know where embeddings are stored
- do not perform retrieval
When to customize readers
You may want to override readers when:- PDFs contain diagrams or screenshots
- code requires custom parsing rules
- domain-specific formats need special handling
Summary
- Readers define how files are parsed and chunked
- They run after loaders and before embeddings
- RAGLight includes built-in readers for text, PDFs, and code
- Readers can be overridden per extension
- Advanced readers (e.g. VLM-based) unlock multimodal RAG