> ## Documentation Index
> Fetch the complete documentation index at: https://docs.raglight.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Knowledge Sources

> How RAGLight discovers, loads, and prepares data for ingestion.

# Knowledge Sources

## Overview

Knowledge sources define **where your data comes from**.

In RAGLight, loaders are responsible for:

* declaring *what* data should be ingested
* describing *where* this data lives (local filesystem, GitHub, …)
* feeding the ingestion pipeline used by vector stores

They sit **upstream** of embeddings and vector stores:

```
Knowledge Sources → Document Processing → Embeddings → Vector Store → Retrieval
```

Loaders are intentionally simple and declarative. They do not perform ingestion
by themselves — they only describe sources.

***

## Design philosophy

RAGLight separates **data declaration** from **data ingestion**:

* loaders describe *what to ingest*
* vector stores decide *how to ingest it*

This makes pipelines:

* predictable
* composable
* easy to reason about

***

## What is a Knowledge Source?

A **knowledge source** is a structured object describing a data origin.

In RAGLight, knowledge sources are typed models that can be:

* local folders
* remote repositories

All sources are normalized into files that are later processed by document processors.

***

## Available loaders

RAGLight currently provides two built-in loaders:

* `FolderSource`
* `GitHubSource`

Both implement the same conceptual contract: *expose files to the ingestion pipeline*.

***

## FolderSource

### Concept

`FolderSource` represents a directory on your local filesystem.

All supported files inside this directory (recursively) will be ingested by the vector store.

Typical use cases:

* documentation folders
* PDFs and reports
* codebases
* mixed data (docs + code)

***

### Definition

```python theme={null}
from pydantic import BaseModel
from typing import Literal

class FolderSource(BaseModel):
    type: Literal["folder"] = "folder"
    path: str
```

The `path` must point to an existing directory.

***

### Example usage

```python theme={null}
from raglight.models.data_source_model import FolderSource

source = FolderSource(path="./data/knowledge_base")
```

When used in a pipeline, all files inside this folder are passed to the ingestion pipeline.

***

## GitHubSource

### Concept

`GitHubSource` represents a **remote GitHub repository**.

RAGLight will:

1. fetch the repository
2. checkout the specified branch
3. expose its files to the ingestion pipeline

This makes it easy to build RAG systems over:

* open-source repositories
* internal tools
* documentation hosted on GitHub

***

### Definition

```python theme={null}
from pydantic import BaseModel, Field
from typing import Literal

class GitHubSource(BaseModel):
    type: Literal["github"] = "github"
    url: str
    branch: str = Field(default="main")
```

***

### Example usage

```python theme={null}
from raglight.models.data_source_model import GitHubSource

source = GitHubSource(
    url="https://github.com/Bessouat40/RAGLight",
    branch="main",
)
```

The repository content is then treated exactly like a local folder.

***

## Using multiple sources

You can combine multiple knowledge sources in a single pipeline.

```python theme={null}
from raglight.models.data_source_model import FolderSource, GitHubSource

knowledge_base = [
    FolderSource(path="./docs"),
    GitHubSource(url="https://github.com/Bessouat40/RAGLight"),
]
```

All sources are ingested together into the same vector store (and collections).

***

## How loaders integrate with pipelines

Loaders are consumed by pipelines during the `build()` phase.

Example with `RAGPipeline`:

```python theme={null}
from raglight.rag.simple_rag_api import RAGPipeline
from raglight.config.rag_config import RAGConfig

config = RAGConfig(
    knowledge_base=[
        FolderSource(path="./data"),
        GitHubSource(url="https://github.com/Bessouat40/RAGLight"),
    ]
)

pipeline = RAGPipeline(config, vector_store_config)
pipeline.build()
```

During `build()`:

* loaders are resolved
* files are passed to the vector store ingestion pipeline
* embeddings are computed and stored

***

## Ignore folders

When loading folder-based sources, some directories should not be indexed.

RAGLight supports ignore rules via:

* default ignore list: `Settings.DEFAULT_IGNORE_FOLDERS`
* per-pipeline overrides

Ignored folders are filtered **before** document processing.

***

## Loaders vs Document Processors

It is important to distinguish the two:

* **Loaders** decide *where data comes from*
* **Document processors** decide *how files are interpreted*

For example:

* `FolderSource` exposes a `.pdf`
* `PDFProcessor` decides how to chunk and parse it

This separation keeps responsibilities clear and extensible.

***

## Summary

* Loaders define data origins (folders, repositories)
* They are declarative and lightweight
* They feed the ingestion pipeline used by vector stores
* RAGLight currently supports `FolderSource` and `GitHubSource`
* Multiple sources can be combined in a single pipeline
