RAG Pipelines
Overview
A RAG (Retrieval-Augmented Generation) pipeline combines two steps:- Retrieve relevant documents from a vector store
- Generate an answer using a Large Language Model (LLM) conditioned on those documents
- a high-level API with
RAGPipeline(simple and explicit) - a low-level Builder API (fully composable and customizable)
- loaders (knowledge sources)
- readers (document processors)
- embeddings
- vector stores
- LLMs
How RAG works in RAGLight
At runtime, a RAG pipeline follows this flow:RAGConfig or the Builder API.
Option 1: RAGPipeline (simple API)
RAGPipeline is the recommended entry point if you want:
- a clear, batteries-included RAG setup
- minimal boilerplate
- fast prototyping
Basic example
What happens during build()
Calling pipeline.build() triggers:
- resolution of knowledge sources
- document ingestion
- chunking and embedding
- storage in the vector store
Querying the pipeline
- the query is embedded
- the vector store retrieves top-k chunks
- chunks are injected into a prompt
- the LLM generates an answer
Option 2: Builder API (advanced)
The Builder API exposes all RAG components explicitly. Use it when you want:- fine-grained control over each component
- custom ingestion workflows
- advanced experimentation
Building a RAG pipeline step by step
Ingesting documents manually
With the Builder API, ingestion is explicit:- control when ingestion happens
- reuse the same vector store across pipelines
- debug indexing issues
Querying the RAG pipeline
RAGPipeline.
Choosing between RAGPipeline and Builder
| Use case | Recommended approach |
|---|---|
| Quick prototype | RAGPipeline |
| Minimal code | RAGPipeline |
| Fine-grained control | Builder API |
| Custom ingestion | Builder API |
| Advanced experimentation | Builder API |
Common parameters
Regardless of the API, the following parameters matter:| Parameter | Default | Description |
|---|---|---|
k | 2 | Number of retrieved chunks per query |
provider | Ollama | LLM provider |
llm | (see Settings) | LLM model name |
api_base | http://localhost:11434 | LLM API base URL |
system_prompt | (default prompt) | Prompt injected before context |
cross_encoder_model | None | Optional cross-encoder for reranking retrieved chunks |
reformulation | True | Rewrite follow-up questions as standalone queries before retrieval |
max_history | 20 | Maximum number of messages kept in conversation history (None = unlimited) |
The default
k=2 in RAGConfig is intentionally conservative. Set k=5 or higher for broader retrieval coverage.Streaming
All LLM providers support token-by-token streaming viagenerate_streaming(), available on both RAGPipeline and the Builder’s RAG object.
The streaming path runs the full pipeline (reformulation → retrieval → reranking) and then yields answer chunks as they are produced by the LLM, instead of waiting for the complete response.
With RAGPipeline
With the Builder API
Streaming is supported by all providers: Ollama, OpenAI, vLLM, LMStudio, Mistral, Google Gemini, and AWS Bedrock. Conversation history is updated automatically at the end of the stream, just like with
generate().Conversation history
RAGLight automatically maintains conversation history acrossgenerate() calls. Each turn appends a user and an assistant message that are passed to the LLM on the next call — enabling genuine multi-turn conversations.
History is supported by all providers: Ollama, OpenAI, Mistral, LMStudio, Google Gemini, and AWS Bedrock.
Limit history size with max_history
By default, history is capped at 20 messages (~10 turns) to avoid hitting the model’s context window. Set max_history to adjust this limit, or pass None for unlimited history:
Summary
- RAG pipelines retrieve documents before generating answers
- RAGLight offers a simple (
RAGPipeline) and an advanced (Builder) API - Both approaches share the same core logic
- Use
generate()for a complete string answer,generate_streaming()to yield tokens progressively - Streaming is supported by all providers (Ollama, OpenAI, vLLM, LMStudio, Mistral, Gemini, Bedrock)
- Conversation history is maintained automatically and works across all providers and both generate methods
- Use
max_historyto cap history size and avoid context overflow - Query reformulation is enabled by default and improves retrieval in multi-turn conversations
- Choose simplicity or control depending on your use case
Query Reformulation
Learn how RAGLight rewrites follow-up questions to improve retrieval accuracy.