> ## Documentation Index
> Fetch the complete documentation index at: https://docs.raglight.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Readers & Document Processors

> How RAGLight reads, parses, and chunks files before embedding.

# Readers & Document Processors

## Overview

**Readers** (called *Document Processors* in the codebase) define **how raw files are read and transformed into documents** that can be embedded and stored.

In RAGLight, readers are responsible for:

* understanding file formats (PDF, code, text, …)
* extracting meaningful content
* chunking content into embedding-ready pieces
* optionally extracting structured elements (e.g. code classes)

They operate **after loaders** and **before embeddings**:

```
Loaders → Readers / Processors → Chunks → Embeddings → Vector Store
```

***

## Design philosophy

RAGLight intentionally separates:

* **where data comes from** (loaders)
* **how data is interpreted** (readers)

This separation allows you to:

* reuse the same data source with different parsing strategies
* override readers without touching the ingestion logic
* experiment with advanced parsing (VLMs, custom code analysis, etc.)

***

## What is a Document Processor?

A **Document Processor** takes a file path and returns structured documents.

Conceptually, a processor:

1. reads a file
2. extracts relevant content
3. splits it into chunks
4. returns:

   * document chunks
   * optional structured documents (e.g. classes)

All processors inherit from a common base and expose a unified interface.

***

## Processor selection

RAGLight uses a `DocumentProcessorFactory` to automatically select the right processor based on file extension.

This means:

* you do not manually assign processors
* the ingestion pipeline chooses the correct reader for each file

***

## Built-in readers

RAGLight ships with built-in processors for common file types.

### TextProcessor

Used for plain text formats:

* `.txt`
* `.md`
* `.html`

It extracts raw text and chunks it into overlapping segments suitable for embeddings.

***

### PDFProcessor

Used for `.pdf` files.

The default PDF processor:

* extracts textual content
* ignores images and diagrams
* chunks the extracted text

This is sufficient for many documentation-style PDFs.

***

### CodeProcessor

Used for source code files:

* `.py`, `.js`, `.ts`, `.java`, `.cpp`, `.cs`, …

The code processor:

* extracts code blocks
* optionally extracts class or function signatures
* produces:

  * standard chunks (for semantic search)
  * class documents (stored separately)

This enables code-aware retrieval and class-level search.

***

## Metadata handling

Each chunk includes metadata such as:

* source path
* file type
* line numbers or identifiers (when available)

RAGLight ensures all metadata is JSON-safe by flattening complex objects
(lists, dicts → strings) before storage.

***

## Overriding default readers

One of RAGLight’s strengths is the ability to **override readers per file extension**.

This is done via the `custom_processors` argument when building a vector store.

***

## Example: Custom PDF reader with a VLM

A common advanced use case is processing PDFs with diagrams or images using a
**Vision-Language Model (VLM)**.

```python theme={null}
from raglight.document_processing.vlm_pdf_processor import VlmPDFProcessor
from raglight.llm.ollama_model import OllamaModel
from raglight.rag.builder import Builder
from raglight.config.settings import Settings

vlm = OllamaModel(
    model_name="ministral-3:3b",
    system_prompt="You are a technical documentation visual assistant.",
)

custom_processors = {
    "pdf": VlmPDFProcessor(vlm)
}

vector_store = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name=Settings.DEFAULT_EMBEDDINGS_MODEL)
    .with_vector_store(
        Settings.CHROMA,
        persist_directory="./defaultDb",
        collection_name="default",
        custom_processors=custom_processors,
    )
    .build_vector_store()
)

vector_store.ingest(data_path="./data")
```

With this setup:

* PDFs are processed using the VLM-based reader
* other file types keep their default processors

***

## Readers and vector stores

Readers are **vector-store agnostic**.

They:

* produce `Document` objects
* do not know where embeddings are stored
* do not perform retrieval

This allows the same readers to work with any current or future vector store backend.

***

## When to customize readers

You may want to override readers when:

* PDFs contain diagrams or screenshots
* code requires custom parsing rules
* domain-specific formats need special handling

Readers are the correct extension point for **content understanding**.

***

## Summary

* Readers define how files are parsed and chunked
* They run after loaders and before embeddings
* RAGLight includes built-in readers for text, PDFs, and code
* Readers can be overridden per extension
* Advanced readers (e.g. VLM-based) unlock multimodal RAG
