Skip to main content

RAG Pipelines

Overview

A RAG (Retrieval-Augmented Generation) pipeline combines two steps:
  1. Retrieve relevant documents from a vector store
  2. Generate an answer using a Large Language Model (LLM) conditioned on those documents
RAGLight provides two ways to build a standard RAG pipeline:
  • a high-level API with RAGPipeline (simple and explicit)
  • a low-level Builder API (fully composable and customizable)
Both approaches rely on the same core components:
  • loaders (knowledge sources)
  • readers (document processors)
  • embeddings
  • vector stores
  • LLMs

How RAG works in RAGLight

At runtime, a RAG pipeline follows this flow:
User Question

[Reformulation]          ← rewrites the question using conversation history (enabled by default)

Vector Store (similarity search)

Retrieved Documents

[Cross-encoder reranking] ← optional

Prompt Construction

LLM Generation

Final Answer
Optional steps can be toggled independently via RAGConfig or the Builder API.

Option 1: RAGPipeline (simple API)

RAGPipeline is the recommended entry point if you want:
  • a clear, batteries-included RAG setup
  • minimal boilerplate
  • fast prototyping

Basic example

from raglight.rag.simple_rag_api import RAGPipeline
from raglight.models.data_source_model import FolderSource, GitHubSource
from raglight.config.settings import Settings
from raglight.config.rag_config import RAGConfig
from raglight.config.vector_store_config import VectorStoreConfig

Settings.setup_logging()

knowledge_base = [
    FolderSource(path="./data/knowledge_base"),
    GitHubSource(url="https://github.com/Bessouat40/RAGLight"),
]

vector_store_config = VectorStoreConfig(
    embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
    provider=Settings.HUGGINGFACE,
    database=Settings.CHROMA,
    persist_directory="./defaultDb",
    collection_name=Settings.DEFAULT_COLLECTION_NAME,
)

config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
    knowledge_base=knowledge_base,
    k=5,
)

pipeline = RAGPipeline(config, vector_store_config)

pipeline.build()

response = pipeline.generate(
    "How can I create an easy RAGPipeline using RAGLight?"
)
print(response)

What happens during build()

Calling pipeline.build() triggers:
  1. resolution of knowledge sources
  2. document ingestion
  3. chunking and embedding
  4. storage in the vector store
Once built, the pipeline is ready for querying.

Querying the pipeline

response = pipeline.generate("Explain how RAG works")
Behind the scenes:
  • the query is embedded
  • the vector store retrieves top-k chunks
  • chunks are injected into a prompt
  • the LLM generates an answer

Option 2: Builder API (advanced)

The Builder API exposes all RAG components explicitly. Use it when you want:
  • fine-grained control over each component
  • custom ingestion workflows
  • advanced experimentation

Building a RAG pipeline step by step

from raglight.rag.builder import Builder
from raglight.config.settings import Settings

builder = Builder()

rag = (
    builder
    .with_embeddings(
        Settings.HUGGINGFACE,
        model_name=Settings.DEFAULT_EMBEDDINGS_MODEL,
    )
    .with_vector_store(
        Settings.CHROMA,
        persist_directory="./defaultDb",
        collection_name=Settings.DEFAULT_COLLECTION_NAME,
    )
    .with_llm(
        Settings.OLLAMA,
        model_name=Settings.DEFAULT_LLM,
        system_prompt=Settings.DEFAULT_SYSTEM_PROMPT,
    )
    .build_rag(k=5)
)

Ingesting documents manually

With the Builder API, ingestion is explicit:
rag.vector_store.ingest(data_path="./data")
This makes it easy to:
  • control when ingestion happens
  • reuse the same vector store across pipelines
  • debug indexing issues

Querying the RAG pipeline

response = rag.generate("How does RAGLight structure a RAG pipeline?")
print(response)
The retrieval and generation logic is identical to RAGPipeline.

Choosing between RAGPipeline and Builder

Use caseRecommended approach
Quick prototypeRAGPipeline
Minimal codeRAGPipeline
Fine-grained controlBuilder API
Custom ingestionBuilder API
Advanced experimentationBuilder API
Both APIs produce the same internal RAG graph.

Common parameters

Regardless of the API, the following parameters matter:
ParameterDefaultDescription
k2Number of retrieved chunks per query
providerOllamaLLM provider
llm(see Settings)LLM model name
api_basehttp://localhost:11434LLM API base URL
system_prompt(default prompt)Prompt injected before context
cross_encoder_modelNoneOptional cross-encoder for reranking retrieved chunks
reformulationTrueRewrite follow-up questions as standalone queries before retrieval
max_history20Maximum number of messages kept in conversation history (None = unlimited)
These parameters directly affect answer quality and latency.
The default k=2 in RAGConfig is intentionally conservative. Set k=5 or higher for broader retrieval coverage.

Streaming

All LLM providers support token-by-token streaming via generate_streaming(), available on both RAGPipeline and the Builder’s RAG object. The streaming path runs the full pipeline (reformulation → retrieval → reranking) and then yields answer chunks as they are produced by the LLM, instead of waiting for the complete response.

With RAGPipeline

pipeline = RAGPipeline(config, vector_store_config)
pipeline.build()

for chunk in pipeline.generate_streaming("How does RAGLight work?"):
    print(chunk, end="", flush=True)
print()  # newline after the stream ends

With the Builder API

rag = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name="all-MiniLM-L6-v2")
    .with_vector_store(Settings.CHROMA, persist_directory="./db", collection_name="col")
    .with_llm(Settings.OLLAMA, model_name="llama3.1:8b")
    .build_rag(k=5)
)

rag.vector_store.ingest(data_path="./docs")

for chunk in rag.generate_streaming("Explain the retrieval pipeline"):
    print(chunk, end="", flush=True)
print()
Streaming is supported by all providers: Ollama, OpenAI, vLLM, LMStudio, Mistral, Google Gemini, and AWS Bedrock. Conversation history is updated automatically at the end of the stream, just like with generate().
Use generate() when you need the full answer as a string. Use generate_streaming() when building interactive UIs or CLI tools where you want to display tokens as they arrive.

Conversation history

RAGLight automatically maintains conversation history across generate() calls. Each turn appends a user and an assistant message that are passed to the LLM on the next call — enabling genuine multi-turn conversations. History is supported by all providers: Ollama, OpenAI, Mistral, LMStudio, Google Gemini, and AWS Bedrock.

Limit history size with max_history

By default, history is capped at 20 messages (~10 turns) to avoid hitting the model’s context window. Set max_history to adjust this limit, or pass None for unlimited history:
# Via RAGConfig
config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
    max_history=20,  # keep last 20 messages (~10 turns)
)

# Via Builder
rag = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name="all-MiniLM-L6-v2")
    .with_vector_store(Settings.CHROMA, persist_directory="./db", collection_name="col")
    .with_llm(Settings.OLLAMA, model_name="llama3.1:8b")
    .build_rag(k=5, max_history=20)
)
A good rule of thumb: set max_history to roughly 2× the number of conversation turns you want to retain. Each turn produces 2 messages (user + assistant).

Summary

  • RAG pipelines retrieve documents before generating answers
  • RAGLight offers a simple (RAGPipeline) and an advanced (Builder) API
  • Both approaches share the same core logic
  • Use generate() for a complete string answer, generate_streaming() to yield tokens progressively
  • Streaming is supported by all providers (Ollama, OpenAI, vLLM, LMStudio, Mistral, Gemini, Bedrock)
  • Conversation history is maintained automatically and works across all providers and both generate methods
  • Use max_history to cap history size and avoid context overflow
  • Query reformulation is enabled by default and improves retrieval in multi-turn conversations
  • Choose simplicity or control depending on your use case

Query Reformulation

Learn how RAGLight rewrites follow-up questions to improve retrieval accuracy.