Streaming

Overview

RAGLight supports token-by-token streaming on all LLM providers via generate_streaming(). The method returns a Python generator — your application receives each chunk as soon as the model produces it, without waiting for the full response. Streaming and non-streaming are fully interchangeable. The same pipeline, the same config, the same providers — just a different method call.

Usage

With RAGPipeline

from raglight.rag.simple_rag_api import RAGPipeline
from raglight.config.rag_config import RAGConfig
from raglight.config.vector_store_config import VectorStoreConfig
from raglight.config.settings import Settings

config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
)

vector_store_config = VectorStoreConfig(
    embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
    provider=Settings.HUGGINGFACE,
    database=Settings.CHROMA,
    persist_directory="./myDb",
    collection_name="my_collection",
)

pipeline = RAGPipeline(config, vector_store_config)
pipeline.build()

for chunk in pipeline.generate_streaming("What is RAGLight?"):
    print(chunk, end="", flush=True)
print()

With the Builder API

from raglight.rag.builder import Builder
from raglight.config.settings import Settings

rag = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name=Settings.DEFAULT_EMBEDDINGS_MODEL)
    .with_vector_store(Settings.CHROMA, persist_directory="./myDb", collection_name="my_collection")
    .with_llm(Settings.OLLAMA, model_name=Settings.DEFAULT_LLM)
    .build_rag(k=5)
)

for chunk in rag.generate_streaming({"question": "Explain the retrieval pipeline"}):
    print(chunk, end="", flush=True)
print()

Supported providers

Streaming is available on all LLM providers:

Provider	Constant
Ollama	`Settings.OLLAMA`
OpenAI	`Settings.OPENAI`
Mistral	`Settings.MISTRAL`
Google Gemini	`Settings.GOOGLE_GEMINI`
LMStudio	`Settings.LMSTUDIO`
AWS Bedrock	`Settings.AWS_BEDROCK`

Streaming with Langfuse

Langfuse tracing works identically for streaming. The trace is emitted when the stream ends — no extra configuration needed.

from raglight.config.langfuse_config import LangfuseConfig

langfuse_config = LangfuseConfig(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000",
)

config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
    langfuse_config=langfuse_config,
)

# ...

for chunk in pipeline.generate_streaming("What is RAGLight?"):
    print(chunk, end="", flush=True)
# → trace appears in Langfuse once the stream completes

REST API streaming

The raglight serve REST API exposes streaming via a Server-Sent Events endpoint:

curl -X POST http://localhost:8000/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"question": "What is RAGLight?"}' \
  --no-buffer

The response is a stream of data: {...} events, terminated by data: [DONE].

Summary

Use generate_streaming() instead of generate() — no other changes needed
Returns a generator — iterate it to receive chunks
All providers supported
Langfuse tracing works transparently on streaming calls

Configuration

Generation

Embeddings

Retrieval

Ingestion

Pipelines & Deployment

Integrations

Streaming

Streaming

Overview

Usage

With RAGPipeline

With the Builder API

Supported providers

Streaming with Langfuse

REST API streaming

Summary

Configuration

Generation

Embeddings

Retrieval

Ingestion

Pipelines & Deployment

Integrations

Documentation Index

​Streaming

​Overview

​Usage

​With RAGPipeline

​With the Builder API

​Supported providers

​Streaming with Langfuse

​REST API streaming

​Summary

Streaming

Overview

Usage

With RAGPipeline

With the Builder API

Supported providers

Streaming with Langfuse

REST API streaming

Summary