https://github.com/jonathanfavorite/ragamuffin
A lightweight, cross-platform .NET library for building RAG (Retrieval-Augmented Generation) pipelines with local embedding models and SQLite vector storage. Perfect for developers who need privacy-focused, offline-capable document search and AI-powered question answering without external API dependencies.
https://github.com/jonathanfavorite/ragamuffin
ai chunking document-processing dotnet embedding-models fluent-api local-ai metadata ml nlp offline-ai onnx pdf-processing privacy-focused rag retrieval-augmented-generation semantic-search sqlite vector-database vector-search
Last synced: 12 days ago
JSON representation
A lightweight, cross-platform .NET library for building RAG (Retrieval-Augmented Generation) pipelines with local embedding models and SQLite vector storage. Perfect for developers who need privacy-focused, offline-capable document search and AI-powered question answering without external API dependencies.
- Host: GitHub
- URL: https://github.com/jonathanfavorite/ragamuffin
- Owner: jonathanfavorite
- License: apache-2.0
- Created: 2025-06-27T01:22:35.000Z (12 months ago)
- Default Branch: master
- Last Pushed: 2025-07-08T01:25:12.000Z (11 months ago)
- Last Synced: 2025-07-08T03:06:55.418Z (11 months ago)
- Topics: ai, chunking, document-processing, dotnet, embedding-models, fluent-api, local-ai, metadata, ml, nlp, offline-ai, onnx, pdf-processing, privacy-focused, rag, retrieval-augmented-generation, semantic-search, sqlite, vector-database, vector-search
- Language: C#
- Homepage: https://www.nuget.org/packages/RAGamuffin
- Size: 6.75 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README

[](https://www.nuget.org/packages/RAGamuffin) [](https://github.com/jonathanfavorite/RAGamuffin/actions) [](LICENSE)
A lightweight, cross-platform .NET library for building RAG (Retrieval-Augmented Generation) pipelines with local embedding models and SQLite vector storage.
## 🚀 Features
- **Local Embedding Models**: Use ONNX models for offline, privacy-focused embeddings
- **SQLite Vector Storage**: Lightweight, file-based vector database with no external dependencies
- **Multi-Format Support**: Process PDFs and text files with intelligent chunking
- **Flexible Training Strategies**: Retrain from scratch, incremental updates, or add-only modes
- **Real-time Ingestion**: Stream text content directly into your vector store
- **Metadata Preservation**: Maintain document context and metadata throughout the pipeline
- **Cross-Platform**: Works on Windows, macOS, and Linux with .NET 8.0+
## 🎯 Quick Start
### Installation
```bash
dotnet add package RAGamuffin
```
### Basic Usage
```csharp
using RAGamuffin.Builders;
using RAGamuffin.Core;
using RAGamuffin.Embedding;
using RAGamuffin.Enums;
// 1. Set up your embedding model (download from HuggingFace)
var embedder = new OnnxEmbedder("path/to/model.onnx", "path/to/tokenizer.json");
// 2. Configure your vector database
var vectorDb = new SqliteDatabaseModel("documents.db", "my_collection");
// 3. Build and train your pipeline
var pipeline = new IngestionTrainingBuilder()
.WithEmbeddingModel(embedder)
.WithVectorDatabase(vectorDb)
.WithTrainingStrategy(TrainingStrategy.RetrainFromScratch)
.WithTrainingFiles(new[] { "document.pdf" })
.Build();
var ingestedItems = await pipeline.Train();
// 4. Search your documents
string[] results = await pipeline.SearchAndReturnTexts("What is the company policy?", 5);
```
### Real-time Text Ingestion
```csharp
// Stream text content directly into your vector store
var textItems = new[]
{
new TextItem("Meeting notes from Q1", "Q1 was successful with 15% growth..."),
new TextItem("Product roadmap", "Next quarter we'll launch feature X...")
};
var (ingestedItems, model) = await pipeline.TrainWithText(textItems);
```
### Search Existing Vector Store
```csharp
// Search without retraining
var vectorStore = new SqliteVectorStoreProvider("documents.db", "my_collection");
var searchResults = await vectorStore.SearchAsync("your query", embedder, 5);
// Get metadata
var metadata = await vectorStore.GetAllDocumentsMetadataAsync();
```
## 📚 Examples
Check out the comprehensive examples in the `Examples/` directory:
- **[TrainAndSearch](Examples/RAGamuffin.Examples.TrainAndSearch/)**: Complete RAG pipeline with training and search
- **[SearchExistingVectorStore](Examples/RAGamuffin.Examples.SearchExistingVectorStore/)**: Query existing vector stores with metadata
- **[IncrementalTraining](Examples/RAGamuffin.Examples.IncrementalTraining/)**: Add new documents to existing collections
- **[RealTimeIngestion](Examples/RAGamuffin.Examples.RealTimeIngestion/)**: Stream text content in real-time
- **[MetadataRetrieval](Examples/RAGamuffin.Examples.MetadataRetrieval/)**: Work with document metadata and statistics
## 🔧 Configuration
### Embedding Models
RAGamuffin supports ONNX models for cross-platform compatibility. Recommended starter model:
- **Model**: `all-mpnet-base-v2` from HuggingFace
- **Download**: [Model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2/blob/main/onnx/model.onnx) | [Tokenizer](https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/tokenizer.json)
### Training Strategies
- **RetrainFromScratch**: Drop all existing data and retrain
- **IncrementalAdd**: Add new documents (skip if exists)
- **IncrementalUpdate**: Add new documents and update existing ones
- **ProcessOnly**: Only process documents, no vector operations
### Chunking Options
```csharp
// PDF processing options
.WithPdfOptions(new PdfHybridParagraphIngestionOptions
{
MinSize = 0, // Minimum chunk size
MaxSize = 800, // Maximum chunk size
Overlap = 400, // Overlap between chunks
UseMetadata = true // Include document metadata
})
// Text processing options
.WithTextOptions(new TextHybridParagraphIngestionOptions
{
MinSize = 500, // Minimum chunk size
MaxSize = 800, // Maximum chunk size
Overlap = 400, // Overlap between chunks
UseMetadata = true // Include document metadata
})
```
## 🏗️ Architecture
RAGamuffin is built with a modular architecture:
- **Abstractions**: Clean interfaces for embedding, ingestion, and vector storage
- **Core**: Main pipeline logic and data models
- **Embedding**: ONNX-based embedding providers
- **Ingestion**: PDF and text processing engines
- **VectorStores**: SQLite vector database implementation
- **Builders**: Fluent API for pipeline configuration
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## 📄 License
This project is licensed under the MIT License - see the [LICENSE.txt](LICENSE.txt) file for details.
## 🔗 Related Projects
- **[InstructSharp](https://github.com/jonathanfavorite/InstructSharp)**: LLM client library for .NET
- **[PdfPig](https://github.com/UglyToad/PdfPig)**: PDF processing library
- **[Microsoft.ML.OnnxRuntime](https://github.com/microsoft/onnxruntime)**: ONNX model inference
---
**RAGamuffin** - Making RAG pipelines simple and accessible for .NET developers.