Knowledge Sources
Overview
Knowledge sources define where your data comes from. In RAGLight, loaders are responsible for:- declaring what data should be ingested
- describing where this data lives (local filesystem, GitHub, …)
- feeding the ingestion pipeline used by vector stores
Design philosophy
RAGLight separates data declaration from data ingestion:- loaders describe what to ingest
- vector stores decide how to ingest it
- predictable
- composable
- easy to reason about
What is a Knowledge Source?
A knowledge source is a structured object describing a data origin. In RAGLight, knowledge sources are typed models that can be:- local folders
- remote repositories
Available loaders
RAGLight currently provides two built-in loaders:FolderSourceGitHubSource
FolderSource
Concept
FolderSource represents a directory on your local filesystem.
All supported files inside this directory (recursively) will be ingested by the vector store.
Typical use cases:
- documentation folders
- PDFs and reports
- codebases
- mixed data (docs + code)
Definition
path must point to an existing directory.
Example usage
GitHubSource
Concept
GitHubSource represents a remote GitHub repository.
RAGLight will:
- fetch the repository
- checkout the specified branch
- expose its files to the ingestion pipeline
- open-source repositories
- internal tools
- documentation hosted on GitHub
Definition
Example usage
Using multiple sources
You can combine multiple knowledge sources in a single pipeline.How loaders integrate with pipelines
Loaders are consumed by pipelines during thebuild() phase.
Example with RAGPipeline:
build():
- loaders are resolved
- files are passed to the vector store ingestion pipeline
- embeddings are computed and stored
Ignore folders
When loading folder-based sources, some directories should not be indexed. RAGLight supports ignore rules via:- default ignore list:
Settings.DEFAULT_IGNORE_FOLDERS - per-pipeline overrides
Loaders vs Document Processors
It is important to distinguish the two:- Loaders decide where data comes from
- Document processors decide how files are interpreted
FolderSourceexposes a.pdfPDFProcessordecides how to chunk and parse it
Summary
- Loaders define data origins (folders, repositories)
- They are declarative and lightweight
- They feed the ingestion pipeline used by vector stores
- RAGLight currently supports
FolderSourceandGitHubSource - Multiple sources can be combined in a single pipeline