Skip to main content

Knowledge Sources

Overview

Knowledge sources define where your data comes from. In RAGLight, loaders are responsible for:
  • declaring what data should be ingested
  • describing where this data lives (local filesystem, GitHub, …)
  • feeding the ingestion pipeline used by vector stores
They sit upstream of embeddings and vector stores:
Knowledge Sources → Document Processing → Embeddings → Vector Store → Retrieval
Loaders are intentionally simple and declarative. They do not perform ingestion by themselves — they only describe sources.

Design philosophy

RAGLight separates data declaration from data ingestion:
  • loaders describe what to ingest
  • vector stores decide how to ingest it
This makes pipelines:
  • predictable
  • composable
  • easy to reason about

What is a Knowledge Source?

A knowledge source is a structured object describing a data origin. In RAGLight, knowledge sources are typed models that can be:
  • local folders
  • remote repositories
All sources are normalized into files that are later processed by document processors.

Available loaders

RAGLight currently provides two built-in loaders:
  • FolderSource
  • GitHubSource
Both implement the same conceptual contract: expose files to the ingestion pipeline.

FolderSource

Concept

FolderSource represents a directory on your local filesystem. All supported files inside this directory (recursively) will be ingested by the vector store. Typical use cases:
  • documentation folders
  • PDFs and reports
  • codebases
  • mixed data (docs + code)

Definition

from pydantic import BaseModel
from typing import Literal

class FolderSource(BaseModel):
    type: Literal["folder"] = "folder"
    path: str
The path must point to an existing directory.

Example usage

from raglight.models.data_source_model import FolderSource

source = FolderSource(path="./data/knowledge_base")
When used in a pipeline, all files inside this folder are passed to the ingestion pipeline.

GitHubSource

Concept

GitHubSource represents a remote GitHub repository. RAGLight will:
  1. fetch the repository
  2. checkout the specified branch
  3. expose its files to the ingestion pipeline
This makes it easy to build RAG systems over:
  • open-source repositories
  • internal tools
  • documentation hosted on GitHub

Definition

from pydantic import BaseModel, Field
from typing import Literal

class GitHubSource(BaseModel):
    type: Literal["github"] = "github"
    url: str
    branch: str = Field(default="main")

Example usage

from raglight.models.data_source_model import GitHubSource

source = GitHubSource(
    url="https://github.com/Bessouat40/RAGLight",
    branch="main",
)
The repository content is then treated exactly like a local folder.

Using multiple sources

You can combine multiple knowledge sources in a single pipeline.
from raglight.models.data_source_model import FolderSource, GitHubSource

knowledge_base = [
    FolderSource(path="./docs"),
    GitHubSource(url="https://github.com/Bessouat40/RAGLight"),
]
All sources are ingested together into the same vector store (and collections).

How loaders integrate with pipelines

Loaders are consumed by pipelines during the build() phase. Example with RAGPipeline:
from raglight.rag.simple_rag_api import RAGPipeline
from raglight.config.rag_config import RAGConfig

config = RAGConfig(
    knowledge_base=[
        FolderSource(path="./data"),
        GitHubSource(url="https://github.com/Bessouat40/RAGLight"),
    ]
)

pipeline = RAGPipeline(config, vector_store_config)
pipeline.build()
During build():
  • loaders are resolved
  • files are passed to the vector store ingestion pipeline
  • embeddings are computed and stored

Ignore folders

When loading folder-based sources, some directories should not be indexed. RAGLight supports ignore rules via:
  • default ignore list: Settings.DEFAULT_IGNORE_FOLDERS
  • per-pipeline overrides
Ignored folders are filtered before document processing.

Loaders vs Document Processors

It is important to distinguish the two:
  • Loaders decide where data comes from
  • Document processors decide how files are interpreted
For example:
  • FolderSource exposes a .pdf
  • PDFProcessor decides how to chunk and parse it
This separation keeps responsibilities clear and extensible.

Summary

  • Loaders define data origins (folders, repositories)
  • They are declarative and lightweight
  • They feed the ingestion pipeline used by vector stores
  • RAGLight currently supports FolderSource and GitHubSource
  • Multiple sources can be combined in a single pipeline