> ## Documentation Index
> Fetch the complete documentation index at: https://docs.raglight.com/llms.txt
> Use this file to discover all available pages before exploring further.

# RAG Pipelines

> Build and run Retrieval-Augmented Generation pipelines in RAGLight.

# RAG Pipelines

## Overview

A **RAG (Retrieval-Augmented Generation) pipeline** combines two steps:

1. **Retrieve** relevant documents from a vector store
2. **Generate** an answer using a Large Language Model (LLM) conditioned on those documents

RAGLight provides **two ways** to build a standard RAG pipeline:

* a **high-level API** with `RAGPipeline` (simple and explicit)
* a **low-level Builder API** (fully composable and customizable)

Both approaches rely on the same core components:

* loaders (knowledge sources)
* readers (document processors)
* embeddings
* vector stores
* LLMs

***

## How RAG works in RAGLight

At runtime, a RAG pipeline follows this flow:

```
User Question
   ↓
[Reformulation]          ← rewrites the question using conversation history (enabled by default)
   ↓
Vector Store (similarity search)
   ↓
Retrieved Documents
   ↓
[Cross-encoder reranking] ← optional
   ↓
Prompt Construction
   ↓
LLM Generation
   ↓
Final Answer
```

Optional steps can be toggled independently via `RAGConfig` or the Builder API.

***

## Option 1: RAGPipeline (simple API)

`RAGPipeline` is the recommended entry point if you want:

* a clear, batteries-included RAG setup
* minimal boilerplate
* fast prototyping

***

### Basic example

```python theme={null}
from raglight.rag.simple_rag_api import RAGPipeline
from raglight.models.data_source_model import FolderSource, GitHubSource
from raglight.config.settings import Settings
from raglight.config.rag_config import RAGConfig
from raglight.config.vector_store_config import VectorStoreConfig

Settings.setup_logging()

knowledge_base = [
    FolderSource(path="./data/knowledge_base"),
    GitHubSource(url="https://github.com/Bessouat40/RAGLight"),
]

vector_store_config = VectorStoreConfig(
    embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
    provider=Settings.HUGGINGFACE,
    database=Settings.CHROMA,
    persist_directory="./defaultDb",
    collection_name=Settings.DEFAULT_COLLECTION_NAME,
)

config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
    knowledge_base=knowledge_base,
    k=5,
)

pipeline = RAGPipeline(config, vector_store_config)

pipeline.build()

response = pipeline.generate(
    "How can I create an easy RAGPipeline using RAGLight?"
)
print(response)
```

***

### What happens during `build()`

Calling `pipeline.build()` triggers:

1. resolution of knowledge sources
2. document ingestion
3. chunking and embedding
4. storage in the vector store

Once built, the pipeline is ready for querying.

***

### Querying the pipeline

```python theme={null}
response = pipeline.generate("Explain how RAG works")
```

Behind the scenes:

* the query is embedded
* the vector store retrieves top-k chunks
* chunks are injected into a prompt
* the LLM generates an answer

***

## Option 2: Builder API (advanced)

The **Builder API** exposes all RAG components explicitly.

Use it when you want:

* fine-grained control over each component
* custom ingestion workflows
* advanced experimentation

***

### Building a RAG pipeline step by step

```python theme={null}
from raglight.rag.builder import Builder
from raglight.config.settings import Settings

builder = Builder()

rag = (
    builder
    .with_embeddings(
        Settings.HUGGINGFACE,
        model_name=Settings.DEFAULT_EMBEDDINGS_MODEL,
    )
    .with_vector_store(
        Settings.CHROMA,
        persist_directory="./defaultDb",
        collection_name=Settings.DEFAULT_COLLECTION_NAME,
    )
    .with_llm(
        Settings.OLLAMA,
        model_name=Settings.DEFAULT_LLM,
        system_prompt=Settings.DEFAULT_SYSTEM_PROMPT,
    )
    .build_rag(k=5)
)
```

***

### Ingesting documents manually

With the Builder API, ingestion is explicit:

```python theme={null}
rag.vector_store.ingest(data_path="./data")
```

This makes it easy to:

* control when ingestion happens
* reuse the same vector store across pipelines
* debug indexing issues

***

### Querying the RAG pipeline

```python theme={null}
response = rag.generate("How does RAGLight structure a RAG pipeline?")
print(response)
```

The retrieval and generation logic is identical to `RAGPipeline`.

***

## Choosing between RAGPipeline and Builder

| Use case                 | Recommended approach |
| ------------------------ | -------------------- |
| Quick prototype          | `RAGPipeline`        |
| Minimal code             | `RAGPipeline`        |
| Fine-grained control     | Builder API          |
| Custom ingestion         | Builder API          |
| Advanced experimentation | Builder API          |

Both APIs produce the same internal RAG graph.

***

## Common parameters

Regardless of the API, the following parameters matter:

| Parameter             | Default                  | Description                                                                  |
| :-------------------- | :----------------------- | :--------------------------------------------------------------------------- |
| `k`                   | `2`                      | Number of retrieved chunks per query                                         |
| `provider`            | `Ollama`                 | LLM provider                                                                 |
| `llm`                 | *(see Settings)*         | LLM model name                                                               |
| `api_base`            | `http://localhost:11434` | LLM API base URL                                                             |
| `system_prompt`       | *(default prompt)*       | Prompt injected before context                                               |
| `cross_encoder_model` | `None`                   | Optional cross-encoder for reranking retrieved chunks                        |
| `reformulation`       | `True`                   | Rewrite follow-up questions as standalone queries before retrieval           |
| `max_history`         | `20`                     | Maximum number of messages kept in conversation history (`None` = unlimited) |

These parameters directly affect answer quality and latency.

<Info>
  The default `k=2` in `RAGConfig` is intentionally conservative. Set `k=5` or higher for broader retrieval coverage.
</Info>

***

## Streaming

All LLM providers support **token-by-token streaming** via `generate_streaming()`, available on both `RAGPipeline` and the Builder's `RAG` object.

The streaming path runs the full pipeline (reformulation → retrieval → reranking) and then yields answer chunks as they are produced by the LLM, instead of waiting for the complete response.

### With RAGPipeline

```python theme={null}
pipeline = RAGPipeline(config, vector_store_config)
pipeline.build()

for chunk in pipeline.generate_streaming("How does RAGLight work?"):
    print(chunk, end="", flush=True)
print()  # newline after the stream ends
```

### With the Builder API

```python theme={null}
rag = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name="all-MiniLM-L6-v2")
    .with_vector_store(Settings.CHROMA, persist_directory="./db", collection_name="col")
    .with_llm(Settings.OLLAMA, model_name="llama3.1:8b")
    .build_rag(k=5)
)

rag.vector_store.ingest(data_path="./docs")

for chunk in rag.generate_streaming("Explain the retrieval pipeline"):
    print(chunk, end="", flush=True)
print()
```

<Info>
  Streaming is supported by all providers: **Ollama**, **OpenAI**, **vLLM**, **LMStudio**, **Mistral**, **Google Gemini**, and **AWS Bedrock**. Conversation history is updated automatically at the end of the stream, just like with `generate()`.
</Info>

<Tip>
  Use `generate()` when you need the full answer as a string. Use `generate_streaming()` when building interactive UIs or CLI tools where you want to display tokens as they arrive.
</Tip>

***

## Conversation history

RAGLight automatically maintains conversation history across `generate()` calls. Each turn appends a `user` and an `assistant` message that are passed to the LLM on the next call — enabling genuine multi-turn conversations.

**History is supported by all providers**: Ollama, OpenAI, Mistral, LMStudio, Google Gemini, and AWS Bedrock.

### Limit history size with `max_history`

By default, history is capped at **20 messages** (\~10 turns) to avoid hitting the model's context window. Set `max_history` to adjust this limit, or pass `None` for unlimited history:

```python theme={null}
# Via RAGConfig
config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
    max_history=20,  # keep last 20 messages (~10 turns)
)

# Via Builder
rag = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name="all-MiniLM-L6-v2")
    .with_vector_store(Settings.CHROMA, persist_directory="./db", collection_name="col")
    .with_llm(Settings.OLLAMA, model_name="llama3.1:8b")
    .build_rag(k=5, max_history=20)
)
```

<Tip>
  A good rule of thumb: set `max_history` to roughly 2× the number of conversation turns you want to retain. Each turn produces 2 messages (user + assistant).
</Tip>

***

## Summary

* RAG pipelines retrieve documents before generating answers
* RAGLight offers a simple (`RAGPipeline`) and an advanced (Builder) API
* Both approaches share the same core logic
* Use `generate()` for a complete string answer, `generate_streaming()` to yield tokens progressively
* Streaming is supported by all providers (Ollama, OpenAI, vLLM, LMStudio, Mistral, Gemini, Bedrock)
* Conversation history is maintained automatically and works across all providers and both generate methods
* Use `max_history` to cap history size and avoid context overflow
* Query reformulation is enabled by default and improves retrieval in multi-turn conversations
* Choose simplicity or control depending on your use case

<Card title="Query Reformulation" icon="pencil" href="/documentation/reformulation">
  Learn how RAGLight rewrites follow-up questions to improve retrieval accuracy.
</Card>