Knowledge Sources

Overview

Knowledge sources define where your data comes from. In RAGLight, loaders are responsible for:

declaring what data should be ingested
describing where this data lives (local filesystem, GitHub, …)
feeding the ingestion pipeline used by vector stores

They sit upstream of embeddings and vector stores:

Knowledge Sources → Document Processing → Embeddings → Vector Store → Retrieval

Loaders are intentionally simple and declarative. They do not perform ingestion by themselves — they only describe sources.

Design philosophy

RAGLight separates data declaration from data ingestion:

loaders describe what to ingest
vector stores decide how to ingest it

This makes pipelines:

predictable
composable
easy to reason about

What is a Knowledge Source?

A knowledge source is a structured object describing a data origin. In RAGLight, knowledge sources are typed models that can be:

local folders
remote repositories

All sources are normalized into files that are later processed by document processors.

Available loaders

RAGLight currently provides two built-in loaders:

FolderSource
GitHubSource

Both implement the same conceptual contract: expose files to the ingestion pipeline.

FolderSource

Concept

FolderSource represents a directory on your local filesystem. All supported files inside this directory (recursively) will be ingested by the vector store. Typical use cases:

documentation folders
PDFs and reports
codebases
mixed data (docs + code)

Definition

from pydantic import BaseModel
from typing import Literal

class FolderSource(BaseModel):
    type: Literal["folder"] = "folder"
    path: str

The path must point to an existing directory.

Example usage

from raglight.models.data_source_model import FolderSource

source = FolderSource(path="./data/knowledge_base")

When used in a pipeline, all files inside this folder are passed to the ingestion pipeline.

GitHubSource

Concept

GitHubSource represents a remote GitHub repository. RAGLight will:

fetch the repository
checkout the specified branch
expose its files to the ingestion pipeline

This makes it easy to build RAG systems over:

open-source repositories
internal tools
documentation hosted on GitHub

Definition

from pydantic import BaseModel, Field
from typing import Literal

class GitHubSource(BaseModel):
    type: Literal["github"] = "github"
    url: str
    branch: str = Field(default="main")

Example usage

from raglight.models.data_source_model import GitHubSource

source = GitHubSource(
    url="https://github.com/Bessouat40/RAGLight",
    branch="main",
)

The repository content is then treated exactly like a local folder.

Using multiple sources

You can combine multiple knowledge sources in a single pipeline.

from raglight.models.data_source_model import FolderSource, GitHubSource

knowledge_base = [
    FolderSource(path="./docs"),
    GitHubSource(url="https://github.com/Bessouat40/RAGLight"),
]

All sources are ingested together into the same vector store (and collections).

How loaders integrate with pipelines

Loaders are consumed by pipelines during the build() phase. Example with RAGPipeline:

from raglight.rag.simple_rag_api import RAGPipeline
from raglight.config.rag_config import RAGConfig

config = RAGConfig(
    knowledge_base=[
        FolderSource(path="./data"),
        GitHubSource(url="https://github.com/Bessouat40/RAGLight"),
    ]
)

pipeline = RAGPipeline(config, vector_store_config)
pipeline.build()

During build():

loaders are resolved
files are passed to the vector store ingestion pipeline
embeddings are computed and stored

Ignore folders

When loading folder-based sources, some directories should not be indexed. RAGLight supports ignore rules via:

default ignore list: Settings.DEFAULT_IGNORE_FOLDERS
per-pipeline overrides

Ignored folders are filtered before document processing.

Loaders vs Document Processors

It is important to distinguish the two:

Loaders decide where data comes from
Document processors decide how files are interpreted

For example:

FolderSource exposes a .pdf
PDFProcessor decides how to chunk and parse it

This separation keeps responsibilities clear and extensible.

Summary

Loaders define data origins (folders, repositories)
They are declarative and lightweight
They feed the ingestion pipeline used by vector stores
RAGLight currently supports FolderSource and GitHubSource
Multiple sources can be combined in a single pipeline

Core Concepts

Pipelines

Knowledge Sources

Knowledge Sources

Overview

Design philosophy

What is a Knowledge Source?

Available loaders

FolderSource

Concept

Definition

Example usage

GitHubSource

Concept

Definition

Example usage

Using multiple sources

How loaders integrate with pipelines

Ignore folders

Loaders vs Document Processors

Summary

Core Concepts

Pipelines

​Knowledge Sources

​Overview

​Design philosophy

​What is a Knowledge Source?

​Available loaders

​FolderSource

​Concept

​Definition

​Example usage

​GitHubSource

​Concept

​Definition

​Example usage

​Using multiple sources

​How loaders integrate with pipelines

​Ignore folders

​Loaders vs Document Processors

​Summary

Knowledge Sources

Overview

Design philosophy

What is a Knowledge Source?

Available loaders

FolderSource

Concept

Definition

Example usage

GitHubSource

Concept

Definition

Example usage

Using multiple sources

How loaders integrate with pipelines

Ignore folders

Loaders vs Document Processors

Summary