> ## Documentation Index
> Fetch the complete documentation index at: https://docs.raglight.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Streaming

> Token-by-token output with generate_streaming() — drop-in alongside generate() on all providers.

# Streaming

## Overview

RAGLight supports **token-by-token streaming** on all LLM providers via `generate_streaming()`. The method returns a Python generator — your application receives each chunk as soon as the model produces it, without waiting for the full response.

Streaming and non-streaming are fully interchangeable. The same pipeline, the same config, the same providers — just a different method call.

***

## Usage

### With RAGPipeline

```python theme={null}
from raglight.rag.simple_rag_api import RAGPipeline
from raglight.config.rag_config import RAGConfig
from raglight.config.vector_store_config import VectorStoreConfig
from raglight.config.settings import Settings

config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
)

vector_store_config = VectorStoreConfig(
    embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
    provider=Settings.HUGGINGFACE,
    database=Settings.CHROMA,
    persist_directory="./myDb",
    collection_name="my_collection",
)

pipeline = RAGPipeline(config, vector_store_config)
pipeline.build()

for chunk in pipeline.generate_streaming("What is RAGLight?"):
    print(chunk, end="", flush=True)
print()
```

### With the Builder API

```python theme={null}
from raglight.rag.builder import Builder
from raglight.config.settings import Settings

rag = (
    Builder()
    .with_embeddings(Settings.HUGGINGFACE, model_name=Settings.DEFAULT_EMBEDDINGS_MODEL)
    .with_vector_store(Settings.CHROMA, persist_directory="./myDb", collection_name="my_collection")
    .with_llm(Settings.OLLAMA, model_name=Settings.DEFAULT_LLM)
    .build_rag(k=5)
)

for chunk in rag.generate_streaming({"question": "Explain the retrieval pipeline"}):
    print(chunk, end="", flush=True)
print()
```

***

## Supported providers

Streaming is available on all LLM providers:

| Provider      | Constant                 |
| ------------- | ------------------------ |
| Ollama        | `Settings.OLLAMA`        |
| OpenAI        | `Settings.OPENAI`        |
| Mistral       | `Settings.MISTRAL`       |
| Google Gemini | `Settings.GOOGLE_GEMINI` |
| LMStudio      | `Settings.LMSTUDIO`      |
| AWS Bedrock   | `Settings.AWS_BEDROCK`   |

***

## Streaming with Langfuse

Langfuse tracing works identically for streaming. The trace is emitted when the stream ends — no extra configuration needed.

```python theme={null}
from raglight.config.langfuse_config import LangfuseConfig

langfuse_config = LangfuseConfig(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000",
)

config = RAGConfig(
    llm=Settings.DEFAULT_LLM,
    provider=Settings.OLLAMA,
    langfuse_config=langfuse_config,
)

# ...

for chunk in pipeline.generate_streaming("What is RAGLight?"):
    print(chunk, end="", flush=True)
# → trace appears in Langfuse once the stream completes
```

***

## REST API streaming

The `raglight serve` REST API exposes streaming via a **Server-Sent Events** endpoint:

```bash theme={null}
curl -X POST http://localhost:8000/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"question": "What is RAGLight?"}' \
  --no-buffer
```

The response is a stream of `data: {...}` events, terminated by `data: [DONE]`.

***

## Summary

* Use `generate_streaming()` instead of `generate()` — no other changes needed
* Returns a generator — iterate it to receive chunks
* All providers supported
* Langfuse tracing works transparently on streaming calls
