Documentation Index
Fetch the complete documentation index at: https://docs.raglight.com/llms.txt
Use this file to discover all available pages before exploring further.
Streaming
Overview
RAGLight supports token-by-token streaming on all LLM providers via generate_streaming(). The method returns a Python generator — your application receives each chunk as soon as the model produces it, without waiting for the full response.
Streaming and non-streaming are fully interchangeable. The same pipeline, the same config, the same providers — just a different method call.
Usage
With RAGPipeline
from raglight.rag.simple_rag_api import RAGPipeline
from raglight.config.rag_config import RAGConfig
from raglight.config.vector_store_config import VectorStoreConfig
from raglight.config.settings import Settings
config = RAGConfig(
llm=Settings.DEFAULT_LLM,
provider=Settings.OLLAMA,
)
vector_store_config = VectorStoreConfig(
embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
provider=Settings.HUGGINGFACE,
database=Settings.CHROMA,
persist_directory="./myDb",
collection_name="my_collection",
)
pipeline = RAGPipeline(config, vector_store_config)
pipeline.build()
for chunk in pipeline.generate_streaming("What is RAGLight?"):
print(chunk, end="", flush=True)
print()
With the Builder API
from raglight.rag.builder import Builder
from raglight.config.settings import Settings
rag = (
Builder()
.with_embeddings(Settings.HUGGINGFACE, model_name=Settings.DEFAULT_EMBEDDINGS_MODEL)
.with_vector_store(Settings.CHROMA, persist_directory="./myDb", collection_name="my_collection")
.with_llm(Settings.OLLAMA, model_name=Settings.DEFAULT_LLM)
.build_rag(k=5)
)
for chunk in rag.generate_streaming({"question": "Explain the retrieval pipeline"}):
print(chunk, end="", flush=True)
print()
Supported providers
Streaming is available on all LLM providers:
| Provider | Constant |
|---|
| Ollama | Settings.OLLAMA |
| OpenAI | Settings.OPENAI |
| Mistral | Settings.MISTRAL |
| Google Gemini | Settings.GOOGLE_GEMINI |
| LMStudio | Settings.LMSTUDIO |
| AWS Bedrock | Settings.AWS_BEDROCK |
Streaming with Langfuse
Langfuse tracing works identically for streaming. The trace is emitted when the stream ends — no extra configuration needed.
from raglight.config.langfuse_config import LangfuseConfig
langfuse_config = LangfuseConfig(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="http://localhost:3000",
)
config = RAGConfig(
llm=Settings.DEFAULT_LLM,
provider=Settings.OLLAMA,
langfuse_config=langfuse_config,
)
# ...
for chunk in pipeline.generate_streaming("What is RAGLight?"):
print(chunk, end="", flush=True)
# → trace appears in Langfuse once the stream completes
REST API streaming
The raglight serve REST API exposes streaming via a Server-Sent Events endpoint:
curl -X POST http://localhost:8000/generate/stream \
-H "Content-Type: application/json" \
-d '{"question": "What is RAGLight?"}' \
--no-buffer
The response is a stream of data: {...} events, terminated by data: [DONE].
Summary
- Use
generate_streaming() instead of generate() — no other changes needed
- Returns a generator — iterate it to receive chunks
- All providers supported
- Langfuse tracing works transparently on streaming calls