Skip to main content

Streaming

Overview

RAGLight supports token-by-token streaming on all LLM providers via generate_streaming(). The method returns a Python generator — your application receives each chunk as soon as the model produces it, without waiting for the full response. Streaming and non-streaming are fully interchangeable. The same pipeline, the same config, the same providers — just a different method call.

Usage

With RAGPipeline

from raglight.rag.simple_rag_api import RAGPipeline
from raglight.config.rag_config import RAGConfig
from raglight.config.vector_store_config import VectorStoreConfig
from raglight.config.settings import Settings

config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
)

vector_store_config = VectorStoreConfig(
    embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
    provider=Settings.HUGGINGFACE,
    database=Settings.CHROMA,
    persist_directory="./myDb",
    collection_name="my_collection",
)

pipeline = RAGPipeline(config, vector_store_config)
pipeline.build()

for chunk in pipeline.generate_streaming("What is RAGLight?"):
    print(chunk, end="", flush=True)
print()

With the Builder API

from raglight.rag.builder import Builder
from raglight.config.settings import Settings

rag = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name=Settings.DEFAULT_EMBEDDINGS_MODEL)
    .with_vector_store(Settings.CHROMA, persist_directory="./myDb", collection_name="my_collection")
    .with_llm(Settings.OLLAMA, model_name=Settings.DEFAULT_LLM)
    .build_rag(k=5)
)

for chunk in rag.generate_streaming({"question": "Explain the retrieval pipeline"}):
    print(chunk, end="", flush=True)
print()

Supported providers

Streaming is available on all LLM providers:
ProviderConstant
OllamaSettings.OLLAMA
OpenAISettings.OPENAI
MistralSettings.MISTRAL
Google GeminiSettings.GOOGLE_GEMINI
LMStudioSettings.LMSTUDIO
AWS BedrockSettings.AWS_BEDROCK

Streaming with Langfuse

Langfuse tracing works identically for streaming. The trace is emitted when the stream ends — no extra configuration needed.
from raglight.config.langfuse_config import LangfuseConfig

langfuse_config = LangfuseConfig(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000",
)

config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
    langfuse_config=langfuse_config,
)

# ...

for chunk in pipeline.generate_streaming("What is RAGLight?"):
    print(chunk, end="", flush=True)
# → trace appears in Langfuse once the stream completes

REST API streaming

The raglight serve REST API exposes streaming via a Server-Sent Events endpoint:
curl -X POST http://localhost:8000/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"question": "What is RAGLight?"}' \
  --no-buffer
The response is a stream of data: {...} events, terminated by data: [DONE].

Summary

  • Use generate_streaming() instead of generate() — no other changes needed
  • Returns a generator — iterate it to receive chunks
  • All providers supported
  • Langfuse tracing works transparently on streaming calls