https://github.com/jonathanfavorite/ragamuffin

A lightweight, cross-platform .NET library for building RAG (Retrieval-Augmented Generation) pipelines with local embedding models and SQLite vector storage. Perfect for developers who need privacy-focused, offline-capable document search and AI-powered question answering without external API dependencies.
https://github.com/jonathanfavorite/ragamuffin

ai chunking document-processing dotnet embedding-models fluent-api local-ai metadata ml nlp offline-ai onnx pdf-processing privacy-focused rag retrieval-augmented-generation semantic-search sqlite vector-database vector-search

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/jonathanfavorite/ragamuffin
Owner: jonathanfavorite
License: apache-2.0
Created: 2025-06-27T01:22:35.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2025-07-08T01:25:12.000Z (12 months ago)
Last Synced: 2025-07-08T03:06:55.418Z (12 months ago)
Topics: ai, chunking, document-processing, dotnet, embedding-models, fluent-api, local-ai, metadata, ml, nlp, offline-ai, onnx, pdf-processing, privacy-focused, rag, retrieval-augmented-generation, semantic-search, sqlite, vector-database, vector-search
Language: C#
Homepage: https://www.nuget.org/packages/RAGamuffin
Size: 6.75 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          ![RAGamuffin Banner](https://raw.githubusercontent.com/jonathanfavorite/RAGamuffin/master/assets/banner.jpg)

[![NuGet Version](https://img.shields.io/nuget/v/RAGamuffin?style=for-the-badge&color=brightgreen)](https://www.nuget.org/packages/RAGamuffin)  [![Build Status](https://img.shields.io/github/actions/workflow/status/jonathanfavorite/RAGamuffin/build.yml?style=for-the-badge)](https://github.com/jonathanfavorite/RAGamuffin/actions)  [![MIT License](https://img.shields.io/github/license/jonathanfavorite/RAGamuffin?style=for-the-badge&color=blue)](LICENSE)

A lightweight, cross-platform .NET library for building RAG (Retrieval-Augmented Generation) pipelines with local embedding models and SQLite vector storage.

## 🚀 Features

- **Local Embedding Models**: Use ONNX models for offline, privacy-focused embeddings

- **SQLite Vector Storage**: Lightweight, file-based vector database with no external dependencies

- **Multi-Format Support**: Process PDFs and text files with intelligent chunking

- **Flexible Training Strategies**: Retrain from scratch, incremental updates, or add-only modes

- **Real-time Ingestion**: Stream text content directly into your vector store

- **Metadata Preservation**: Maintain document context and metadata throughout the pipeline

- **Cross-Platform**: Works on Windows, macOS, and Linux with .NET 8.0+

## 🎯 Quick Start

### Installation

```bash

dotnet add package RAGamuffin

```

### Basic Usage

```csharp

using RAGamuffin.Builders;

using RAGamuffin.Core;

using RAGamuffin.Embedding;

using RAGamuffin.Enums;

// 1. Set up your embedding model (download from HuggingFace)

var embedder = new OnnxEmbedder("path/to/model.onnx", "path/to/tokenizer.json");

// 2. Configure your vector database

var vectorDb = new SqliteDatabaseModel("documents.db", "my_collection");

// 3. Build and train your pipeline

var pipeline = new IngestionTrainingBuilder()

    .WithEmbeddingModel(embedder)

    .WithVectorDatabase(vectorDb)

    .WithTrainingStrategy(TrainingStrategy.RetrainFromScratch)

    .WithTrainingFiles(new[] { "document.pdf" })

    .Build();

var ingestedItems = await pipeline.Train();

// 4. Search your documents

string[] results = await pipeline.SearchAndReturnTexts("What is the company policy?", 5);

```

### Real-time Text Ingestion

```csharp

// Stream text content directly into your vector store

var textItems = new[]

{

    new TextItem("Meeting notes from Q1", "Q1 was successful with 15% growth..."),

    new TextItem("Product roadmap", "Next quarter we'll launch feature X...")

};

var (ingestedItems, model) = await pipeline.TrainWithText(textItems);

```

### Search Existing Vector Store

```csharp

// Search without retraining

var vectorStore = new SqliteVectorStoreProvider("documents.db", "my_collection");

var searchResults = await vectorStore.SearchAsync("your query", embedder, 5);

// Get metadata

var metadata = await vectorStore.GetAllDocumentsMetadataAsync();

```

## 📚 Examples

Check out the comprehensive examples in the `Examples/` directory:

- **[TrainAndSearch](Examples/RAGamuffin.Examples.TrainAndSearch/)**: Complete RAG pipeline with training and search

- **[SearchExistingVectorStore](Examples/RAGamuffin.Examples.SearchExistingVectorStore/)**: Query existing vector stores with metadata

- **[IncrementalTraining](Examples/RAGamuffin.Examples.IncrementalTraining/)**: Add new documents to existing collections

- **[RealTimeIngestion](Examples/RAGamuffin.Examples.RealTimeIngestion/)**: Stream text content in real-time

- **[MetadataRetrieval](Examples/RAGamuffin.Examples.MetadataRetrieval/)**: Work with document metadata and statistics

## 🔧 Configuration

### Embedding Models

RAGamuffin supports ONNX models for cross-platform compatibility. Recommended starter model:

- **Model**: `all-mpnet-base-v2` from HuggingFace

- **Download**: [Model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2/blob/main/onnx/model.onnx) | [Tokenizer](https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/tokenizer.json)

### Training Strategies

- **RetrainFromScratch**: Drop all existing data and retrain

- **IncrementalAdd**: Add new documents (skip if exists)

- **IncrementalUpdate**: Add new documents and update existing ones

- **ProcessOnly**: Only process documents, no vector operations

### Chunking Options

```csharp

// PDF processing options

.WithPdfOptions(new PdfHybridParagraphIngestionOptions

{

    MinSize = 0,        // Minimum chunk size

    MaxSize = 800,      // Maximum chunk size

    Overlap = 400,      // Overlap between chunks

    UseMetadata = true  // Include document metadata

})

// Text processing options

.WithTextOptions(new TextHybridParagraphIngestionOptions

{

    MinSize = 500,      // Minimum chunk size

    MaxSize = 800,      // Maximum chunk size

    Overlap = 400,      // Overlap between chunks

    UseMetadata = true  // Include document metadata

})

```

## 🏗️ Architecture

RAGamuffin is built with a modular architecture:

- **Abstractions**: Clean interfaces for embedding, ingestion, and vector storage

- **Core**: Main pipeline logic and data models

- **Embedding**: ONNX-based embedding providers

- **Ingestion**: PDF and text processing engines

- **VectorStores**: SQLite vector database implementation

- **Builders**: Fluent API for pipeline configuration

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## 📄 License

This project is licensed under the MIT License - see the [LICENSE.txt](LICENSE.txt) file for details.

## 🔗 Related Projects

- **[InstructSharp](https://github.com/jonathanfavorite/InstructSharp)**: LLM client library for .NET

- **[PdfPig](https://github.com/UglyToad/PdfPig)**: PDF processing library

- **[Microsoft.ML.OnnxRuntime](https://github.com/microsoft/onnxruntime)**: ONNX model inference

---

**RAGamuffin** - Making RAG pipelines simple and accessible for .NET developers.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jonathanfavorite/ragamuffin

Awesome Lists containing this project

README