Projects in Awesome Lists tagged with chunking
A curated list of projects in awesome lists tagged with chunking .
https://github.com/chonkie-ai/chonkie
🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
ai chunking etl nlp python rag retrieval semantic-segmentation text-chunking text-processing text-splitting vector-search
Last synced: 14 May 2025
https://github.com/jiesutd/NCRFpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
artificial-intelligence char-cnn char-rnn chunking cnn crf lstm lstm-crf named-entity-recognition natural-language-processing nbest ner neural-networks part-of-speech-tagger pytorch sequence-labeling
Last synced: 09 Apr 2025
https://github.com/jiesutd/ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
artificial-intelligence char-cnn char-rnn chunking cnn crf lstm lstm-crf named-entity-recognition natural-language-processing nbest ner neural-networks part-of-speech-tagger pytorch sequence-labeling
Last synced: 15 May 2025
https://github.com/bhavnicksm/chonkie
🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
ai chunking rag retrieval-augmented-generation text-processing
Last synced: 31 Jul 2025
https://github.com/systemd/casync
Content-Addressable Data Synchronization Tool
archive chunking delivery download file-system http synchronization tar upload
Last synced: 15 May 2025
https://github.com/mirth/chonky
Fully neural approach for text chunking
ai chunking llms ml rag semantic-chunking text-splitter
Last synced: 14 Jan 2026
https://github.com/smooks/smooks
An extensible Java framework for building event-driven applications that break up XML and non-XML data into chunks for data integration
analytics chunking enterprise-integration etl event-driven java pipelines sax smooks stream-processing xml
Last synced: 13 May 2025
https://github.com/folbricht/desync
Alternative casync implementation
archive casync chunking golang synchronization
Last synced: 18 Apr 2026
https://github.com/isaacus-dev/semchunk
A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.
chunking isaacus nlp python semantic-chunking splitting text text-chunking text-splitting
Last synced: 15 May 2025
https://github.com/microsoft/rag-experiment-accelerator
The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
acs azure chunking dense embedding evaluation experiment genai indexing information-retrieval llm openai rag sparse vectors
Last synced: 16 May 2025
https://github.com/lazyFrogLOL/llmdocparser
A package for parsing PDFs and analyzing their content using LLMs.
chunking document-analysis llm nlp ocr pdf-parser pdfparser rag text-chunking
Last synced: 01 Apr 2025
https://github.com/26hzhang/neural_sequence_labeling
A TensorFlow implementation of Neural Sequence Labeling model, which is able to tackle sequence labeling tasks such as POS Tagging, Chunking, NER, Punctuation Restoration and etc.
chunking lstm-networks named-entity-recognition pos-tagger punctuation python3 sentence-boundary-detection sequence-labeling tensorflow
Last synced: 20 Aug 2025
https://github.com/swarmauri/swarmauri-sdk
a monorepo featuring modular microkernel frameworks and single purpose extensions
agents ai chunking factories llm-framework measures metrics modular monorepo nlp orchestration orchestration-framework parsing tooling tools vectors
Last synced: 20 May 2026
https://github.com/messkan/rag-chunk
A Python CLI to test, benchmark, and find the best RAG chunking strategy for your Markdown documents.
chunking document-chunking embedding-vectors ia langchain llm nlp python rag rag-pipeline retrieval-augmented-generation text-splitting vector-search
Last synced: 05 Mar 2026
https://github.com/jordicenzano/go-ts-segmenter
Live TS segmenter and HLS manifest creation in Go
chunk chunked chunking golang hls lhls transport-stream video
Last synced: 08 Jul 2025
https://github.com/jparkerweb/semantic-chunking
🍱 semantic-chunking ⇢ semantically create chunks from large document for passing to LLM workflows
chunking embeddings llm semantic-chunking text-chunking text-splitter text-splitting vector
Last synced: 01 May 2025
https://github.com/xtabbas/The-Ultimate-Boilerplate
webpack 2, react hotloader 3, react router v4, code splitting and more
boilerplate chunking hot-reloading react react-router-v4 reactrouter redux server-side-rendering webpack
Last synced: 06 Aug 2025
https://github.com/sammyjo20/laravel-chunkable-jobs
đź“‘ Split Laravel jobs into multiple separate job chunks
chunking hacktoberfest jobs laravel php
Last synced: 26 Oct 2025
https://github.com/esastack/esa-restclient
An asynchronous event-driven HTTP client based on netty.
asynchronous chunking filter h2c haproxy http2 httpclient https interceptor netty retry
Last synced: 02 May 2025
https://github.com/ronomon/deduplication
Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.
chunking content-dependent deduplication nodejs
Last synced: 17 Aug 2025
https://github.com/neondatabase-labs/pgrag
Postgres extensions to support end-to-end Retrieval-Augmented Generation (RAG) pipelines
chunking embeddings pgrx postgresql rag
Last synced: 10 Oct 2025
https://github.com/drmingler/smart-llm-loader
smart-llm-loader is a lightweight yet powerful Python package that transforms any document into LLM-ready chunks. Spend less time on preprocessing headaches and more time building what matters. From RAG systems to chatbots to document Q&A, SmartLLMLoader handles the heavy lifting so you can focus on creating exceptional AI applications.
chatbot chunking claude gemini langchain llama-index markdown openai pdf-converter pdf-parser pdf-to-markdown rag
Last synced: 31 Jul 2025
https://github.com/speedyk-005/chunklet-py
One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
ai chunking chunks-algorithm chunks-processing code-chunking code-structure document-chunking natural-language-processing nlp rag text-splitting visualization
Last synced: 02 Mar 2026
https://github.com/bnosac/crfsuite
Labelling Sequential Data in Natural Language Processing with R - using CRFsuite
chunking conditional-random-fields crf crfsuite data-science intent-classification natural-language-processing ner nlp r r-package
Last synced: 15 Mar 2026
https://github.com/iscc/fastcdc-py
FastCDC implementation in Python https://pypi.org/project/fastcdc/
chunking chunking-algorithm content-dependent deduplication python
Last synced: 17 Feb 2026
https://github.com/danengelbrecht/longtail
Incremental asset delivery library
archive c chunking compression compression-library delivery download syncronization upload
Last synced: 03 Jul 2025
https://github.com/howardyclo/grammar-pattern
Extract and align grammar patterns from English sentences.
chunking grammar grammar-parser grammar-pattern grammar-rules shallow-parser
Last synced: 24 Oct 2025
https://github.com/DanEngelbrecht/longtail
Incremental asset delivery library
archive c chunking compression compression-library delivery download syncronization upload
Last synced: 06 Aug 2025
https://github.com/carlosplanchon/betterhtmlchunking
BetterHTMLChunking is a Python library for intelligent HTML segmentation. It builds a DOM tree from raw HTML and extracts content-rich regions of interest, making content analysis effortless. Great for LLM based processing.
ai chunking html llm splitting
Last synced: 15 Sep 2025
https://github.com/documentatom/documentatom
DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.
ai chunk chunking etl extraction extraction-transformation-and-loading parse parser semantic
Last synced: 31 Oct 2025
https://github.com/danengelbrecht/golongtail
Command line front end for longtail synchronization tool
archive chunking delivery download gcs gcs-bucket s3 s3-storage synchronization upload
Last synced: 06 Jul 2025
https://github.com/drittich/semanticslicer
🧠✂️ SemanticSlicer — A smart text chunker for LLM-ready documents.
ai azure-openai chat-gpt chatgpt chunker chunking embeddings gpt gpt-4 langchain llm openai text-chunking
Last synced: 21 Aug 2025
https://github.com/duriantaco/pykomodo
A Python-based parallel file chunking system designed for processing large codebases into LLM-friendly chunks.
Last synced: 13 Apr 2025
https://github.com/nftstorage/carbites
***Notice: This repository is no longer maintained.*** đźš— đźš™ đźš• Chunking for CAR files. Split a single CAR into multiple CARs.
car chunking cid ipld multiformats splitting
Last synced: 14 Jul 2025
https://github.com/zabuzard/fastcdc4j
Fast and efficient content-defined chunking for data deduplication. Java implementation of FastCDC as library.
cdc chunking content-defined-chunking data-deduplication fastcdc java library
Last synced: 05 Mar 2026
https://github.com/indyjo/cafs
Content-Addressable File System (used by BitWrk)
chunking deduplication download http rolling-hash synchronization upload
Last synced: 27 Dec 2025
https://github.com/remram44/cdchunking-rs
Content-Defined Chunking for Rust
chunk chunking rolling-hash-functions rust
Last synced: 17 Mar 2025
https://github.com/saltyrtc/chunked-dc-js
Binary chunking that can be reassembled out-of-order.
chunking javascript saltyrtc webrtc-datachannels
Last synced: 31 Jul 2025
https://github.com/jchunk-io/jchunk
JChunk is a lightweight and flexible library designed to provide multiple strategies for text chunking within Java applications
chunk chunking etl-pipeline java rag text-splitter text-splitting
Last synced: 10 Mar 2026
https://github.com/gregorbiswanger/semanticchunker.net
Embedding-driven, context-aware text chunking for Semantic Kernel and RAG workflows in .NET
ai chunking csharp dotnet embedding library llm rag semantic-kernel semanticchunker semantickernel slm text-chunking
Last synced: 12 Sep 2025
https://github.com/iprit/md-svg-vue
Material design icons by Google for Vue.js & Nuxt.js (server side support & inline svg with path)
bundling chunking icons inline-svg material-design server-side-rendering svg svg-icons vue vuejs2
Last synced: 13 Sep 2025
https://github.com/linuxscout/mishtar
Mishtar: Named and temporal entities chunker
arabic-language arabic-nlp chunking named-entity-recognition nlp temporal-entities-chunker
Last synced: 03 Aug 2025
https://github.com/yaroslav/inkmark
A very fast, feature-packed, AI-first Markdown (CommonMark/GFM) gem for Ruby, based on pulldown-cmark (Rust).
ai chunking commonmark commonmark-parsing llm markdown markdown-language markdown-parser markdown-to-html pulldown-cmark rag ruby rubyonrails rust
Last synced: 26 May 2026
https://github.com/skerkour/go-benchmarks
Comprehensive and reproducible benchmarks for Go developers and architects.
benchmark benchmarking benchmarks cdc chunking go golang hash hashing
Last synced: 12 Apr 2025
https://github.com/fd0/split
Split large files into smaller ones using deterministic Content Defined Chunking
Last synced: 18 Aug 2025
https://github.com/KernelPanic92/ngx-fastboot
ngx-fastboot is an Angular library designed to dynamically load configuration settings at runtime, optimizing application startup performance by offloading configurations to a separate compilation chunk.
angular boot chunk chunking configuration dynamic fastboot lazy npm performance providers typescript
Last synced: 05 Mar 2025
https://github.com/cckalen/intellichunk
Go Based Lightweight RAG / LLM Tool with CLI + API
ai apiserver chatbot chunking command-line-tool langchain llm
Last synced: 30 Dec 2025
https://github.com/kernelpanic92/ngx-fastboot
ngx-fastboot is an Angular library designed to dynamically load configuration settings at runtime, optimizing application startup performance by offloading configurations to a separate compilation chunk.
angular boot chunk chunking configuration dynamic fastboot lazy npm performance providers typescript
Last synced: 13 Oct 2025
https://github.com/gene-hightower/ghsmtp
Gene's SMTP server — receive Internet mail with less fuss
c-plus-plus chunking cpp cpp17 dkim dmarc rfc-5321 smtp smtp-client smtp-protocol smtp-server smtpd spf tls-support utf-8 utf8
Last synced: 24 Apr 2025
https://github.com/lelserslasers/minecraft
Minecraft clone with an infinite world generated from 3d perlin noise (no game engine)
3d chunking cpp infinite-world itch-io minecraft perlin-noise perlin-noise-3d raylib voxel voxel-engine
Last synced: 14 Oct 2025
https://github.com/ven0maus/flowvitae
Efficient library for managing 2D static and procedural grids in games.
2d cells chunked chunking chunks flowvitae game games generation grid infinite library memory-efficient monogame procedural procgen rendering sadconsole tiles
Last synced: 17 Mar 2025
https://github.com/dcarpintero/ai-engineering
AI Engineering: Annotated NBs to dive into Self-Attention, In-Context Learning, RAG, Knowledge-Graphs, Fine-Tuning, Model Optimization, and many more.
ai-engineering bert chunking embeddings fine-tuning generative-ai huggingface-transformers in-context-learning knowledge-graph langchain large-language-models llama3-1 model-quantization retrieval-augmented-generation self-attention transformer weights-and-biases
Last synced: 02 Mar 2026
https://github.com/adamfoneil/chunkupload
Library for implementing chunked uploads to Azure blob storage. Intended for use with DropzoneJS.
azure-storage chunking dropzonejs uploader
Last synced: 15 Jul 2025
https://github.com/jmaczan/bpe-tokenizer
Byte-Pair Encoding tokenizer for training large language models on huge datasets
bpe bpe-tokenizer byte-pair-encoding chunking deep-learning from-scratch large-language-models llm machine-learning python tokenizer
Last synced: 18 Sep 2025
https://github.com/lh0x00/docsifer
Docsifer is a powerful tool for converting various data formats into Markdown for applications such as indexing, text analysis, and more. It supports PDF, PowerPoint, Word, Excel, Images, Audio, HTML, and other text-based formats, and leverages LLMs to enhance performance.
analysis autogen chunking docsier documents emeddings indexing langchain llama-index markdown markitdown rag text-embeddings text-processing vector-database
Last synced: 23 Apr 2025
https://github.com/shelfio/array-chunk-by-size
Chunk array of objects by their size in JSON
arrays chunk chunking node-module npm-package splitting
Last synced: 25 Jun 2025
https://github.com/zchunk/zchunk-java
A java-native implementation of the zchunk file format
chunking compression datasaving file-transfer format
Last synced: 23 May 2026
https://github.com/gursv/url-summ
A URL summarizer, which summarizes the content of a URL with proper formatting. It uses 'sshleifer/distilbart-cnn-12-6', which is a distilled version of the BART model, specifically optimized for text summarization tasks, including CNN summarization.
ai beautifulsoup chunking formatted-text huggingface-models python3 smtp star-rating streamlit text-extraction text-summarization transformers url-summarization
Last synced: 23 Apr 2025
https://github.com/antoinelrnld/discord-rag
Easily create a RAG based on your Discord messages
ai artificial-intelligence bot chatgpt chunking discord discord-bot embedding genai generative-ai generative-artificial-intelligence langchain llm rag retrieval-augmented-generation vectorization
Last synced: 25 Jun 2025
https://github.com/mg98/ae-chunker-go
Go implementation of the AE chunking algorithm.
chunking chunking-algorithm go golang
Last synced: 25 Feb 2026
https://github.com/dcarpintero/generative-ai-101
Annotated Notebooks to dive into Self-Attention, In-Context Learning, RAG, Knowledge-Graphs, Fine-Tuning, Model Optimization, and many more.
bert chunking embeddings fine-tuning generative-ai huggingface-transformers in-context-learning knowledge-graph langchain large-language-models llama3-1 model-quantization retrieval-augmented-generation self-attention transformer weights-and-biases
Last synced: 14 Mar 2025
https://github.com/kathleenwest/filemanagerdemo
(File Manager – A Demo of a WCF Self-Hosted Service & Client "Tester" Windows Form Application Exchanging Files) This project presents a simple File Manager Service and Client Application demonstration. The File Manager is a self-hosted (service host) WCF application launched and managed with a simple console interface. The client “tester” has a simplified GUI user interface to quickly demo and test the service (Windows Form Application).
chunk chunking csharp csharp-code csharp-library file-management file-manager file-manager-application file-server file-sharing file-transfer file-upload filemanager filemanager-ui stream streaming wcf wcf-client wcf-service wcf-service-client-demo
Last synced: 25 Jul 2025
https://github.com/lynixtaxic/docsifer
Docsifer is a powerful tool for converting various data formats into Markdown for applications such as indexing, text analysis, and more. It supports PDF, PowerPoint, Word, Excel, Images, Audio, HTML, and other text-based formats, and leverages LLMs to enhance performance.
analysis autogen chunking docsier documents emeddings indexing langchain llama-index markdown markitdown rag text-embeddings text-processing vector-database
Last synced: 23 Apr 2025
https://github.com/akshayxml/google-file-system
Implemented Google File System from its research paper.
chunking distributed-systems file-sharing filesystem google-file-system python3 replication
Last synced: 17 May 2026
https://github.com/jonahwhaler/llm-agent-toolkit
LLM AgeToolkit provides minimal, modular interfaces for core components in LLM-based applications.
agent chromadb chunking faiss llm modular-design ollama openai python tool-calling toolkit vision
Last synced: 05 May 2025
https://github.com/cemayan/async-chunk-reader
async-streams chunking streaming
Last synced: 23 Apr 2026
https://github.com/marlon360/nifty-uploader
⬆️ An easy file uploader for the Browser written in TypeScript
chunking javascript typescript uploader
Last synced: 12 Jul 2025
https://github.com/saltyrtc/chunked-dc-swift
Binary chunking that can be reassembled out-of-order.
Last synced: 15 May 2026
https://github.com/iprit/voxel-world
A three.js 3D world
blender chunking es6-javascript es7-async magicavoxel mmo-engine open-world-game skinned-animation three-js voxel
Last synced: 10 Jul 2025
https://github.com/xyproto/projectinfo
Given a directory of source code, find the project name, contributors, collect the source code and output it all in JSON chunks with an upper token limit
Last synced: 04 Jan 2026
https://github.com/simon-zerisenay/42_push_swap
Pushswap is a 42 project emphasizing efficient sorting by minimizing operations. Participants use a limited set of commands to manipulate stacks and achieve the desired sorted order, showcasing algorithm design and optimization skills while developing problem-solving abilities.
42 42pushswap c chunking cprogramming ecole42 linkedlist midpoint pushswap sorting-algorithms stacks struct
Last synced: 18 Oct 2025
https://github.com/atayahmet/blobify
A Javascript automation tool to convert data (file, image etc.) to blob object and vice-versa.
blob blob-files blob-image chunking data-chunk inmemory-cache
Last synced: 22 Mar 2025
https://github.com/craigwardman/chunkingredisclient
A C# library which implements various wrappers around the StackExchange.Redis client, specifically using Newtonsoft.Json serialisation; Such as streamed reading/writing and sliding expiration.
chunking csharp extensions json net-core newtonsoft-json redis redis-client stackexchange-redis wrapper-library
Last synced: 19 Aug 2025
https://github.com/chonkie-inc/mtcb
🤔 wondering if your chunks are good? 🦉 Judie is here to Judge and Evaluate your Chunks! ✨
ai benchmarking chunk chunking judge llm-evaluation observability rag
Last synced: 10 Mar 2026
https://github.com/print3m/chunkmap
ChunkMap is a command-line tool to split large Nmap scans into savable chunks.
chunking command-line-tool nmap nmap-automation nmap-script port-scanner portscanner python python-script
Last synced: 25 May 2026
https://github.com/isaka-james/chunks-to-file
A nodejs chunking system
chunk chunked-uploads chunking chunking-algorithm chunking-files chunks node-chunking nodejs nodejs-chunking
Last synced: 15 Jan 2026
https://github.com/mirpo/chopdoc
A tool to split documents into chunks for RAG and LLM applications
chunking data-engineering filtering gemini llm openai pipeline rag
Last synced: 02 May 2026
https://github.com/acj/file-chunker
Divide a file into evenly-sized chunks
chunking concurrency parallel text-processing
Last synced: 24 Feb 2026
https://github.com/parthapray/docling_rag_langchain_colab
This repo contains codes for RAG using docling on colab notebook with langchain, milvus, huggingface embedding model and LLM
all-minilm-l6-v2 chunking colab-notebook docling huggingface langchain large-language-models milvus pdf retrieval-augmented-generation sentence-transformers
Last synced: 18 May 2026
https://github.com/rse/chunking
Simple Task Chunking
chunking rate-limiting task throttling
Last synced: 16 Jul 2025
https://github.com/abitofhelp/optimized_adaptive_pipeline_rs
Adaptive Rust pipeline for high-throughput file processing—dynamic chunking, parallelism, AES/ChaCha encryption, backpressure, and Prometheus/tracing.
adaptive-concurrency backpressure chunking concurrency data-pipeline encryption file-processing metrics observability opentelemetry parallelism prometheus rust stream-processing tracing
Last synced: 05 Oct 2025
https://github.com/openvoiceos/quebra_frases
chunks strings into byte sized pieces
chunk chunking sentence-chunking tokenization tokenize tokenized tokenizer word-tokenizing
Last synced: 12 Mar 2026
https://github.com/stevewyl/chunk_segmentor
Word Segmentaor with Noun Phrase based on HanLP
Last synced: 15 May 2026
https://github.com/parthapray/docling_colab
This repo contains google colab notebook for handing Docling for data extraction such as text, image, table etc.
chunk chunking colab-notebook docling docx embed extraction-data image lancedb markdown pdf pptx retrieval-augmented-generation table text transformers
Last synced: 16 May 2026
https://github.com/abitofhelp/adaptive_pipeline
Adaptive Rust pipeline for high-throughput file processing—dynamic chunking, parallelism, AES/ChaCha encryption, backpressure, and Prometheus/tracing.
adaptive-concurrency backpressure chunking concurrency data-pipeline encryption file-processing metrics observability opentelemetry parallelism prometheus rust stream-processing tracing
Last synced: 17 May 2026
https://github.com/saltyrtc/chunked-dc-java
Binary chunking that can be reassembled out-of-order.
Last synced: 26 Feb 2025
https://github.com/leo310/rag-chunking-evaluation
Assess the effectiveness of chunking strategies in RAG systems via a custom evaluation framework.
chunking evaluation-framework retrieval retrieval-augmented-generation
Last synced: 22 Jan 2026
https://github.com/kimtth/rag-multimodal-semantic-chunking
🖼️📄E2E Multi-modal Document Preprocessing for Search Indexing with Azure Document Intelligence
azure-document-intelligence chunking image-understanding rag-preparation workshop
Last synced: 05 Aug 2025
https://github.com/abeed04/rag-based-chat-with-pdf-using-llama3
Turn your PDFs into a conversation with Llama3's RAG-powered chat.
chunking faiss-vector-database googlegenerativeai groq langchain llama3 pycharm-community python-3 rag streamlit
Last synced: 09 Apr 2026
https://github.com/yuma-shintani/chunksize-checker
Calculate the number of total tokens, optimal chunk size and chunk overlap from any given document.
Last synced: 10 May 2026
https://github.com/i-partalas/industrial-rag-qna-benchmark
Benchmarking the performance of proprietary vs open-source LLMs in industrial QnA tasks using various RAG-based implementations and evaluation metrics.
azureopenai benchmarking chromadb chunking docker huggingface langchain large-language-models llms-benchmarking metrics openai pytorch retrieval-augmented-generation streamlit synthetic-dataset-generation
Last synced: 28 Jan 2026
https://github.com/hamolicious/chunky
A chunking system for game developement
chunking chunks library pypi pypi-package python
Last synced: 24 Feb 2025
https://github.com/ayush585/smartchunk
SmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.
agentic-workflow chunking chunking-algorithm cli llm nlp package pip rag semantic
Last synced: 07 Sep 2025
https://github.com/zircote/rlm-rs-plugin
Claude Code plugin for processing documents 100x larger than context limits using the Recursive Language Model pattern. Rust-powered chunking, hybrid semantic + BM25 search, and sub-LLM orchestration.
ai-agents bm25 chunking claude-code claude-code-plugin document-processing hybrid-search llm long-context recursive-language-model rlm rust semantic-search sqlite
Last synced: 08 Apr 2026
https://github.com/nihar3453/llm-transformers-and-rag
A hands-on suite for exploring and fine-tuning foundation models (Transformers, BERT, GPT-2, BART) and end-to-end RAG pipelines with attention visualizations, semantic search (ChromaDB/Weaviate), LangChain workflow demos.
bart bert chromadb chunking generative-ai hnsw huggingface-transformers langchain llms minigpt rag transformers vector-database weav
Last synced: 14 Apr 2026
https://github.com/ziffan/chunklab
ChunkLab is a powerful browser-based sandbox designed for developers to test, visualize, and validate text chunking pipeline configurations. Optimize your RAG (Retrieval-Augmented Generation) ingestion process with real-time feedback and detailed metrics.
ai chunking data-preprocessing developer-tools embeddings fastapi llm nlp playground python rag react regex sandbox text-processing tiktoken tokenization vector-database
Last synced: 02 May 2026
https://github.com/nathadriele/acmr-rag-rename-mbausp
Trabalho de Conclusão de Curso do MBA em Data Science e Analytics da USP/ESALQ, turma 2023. Desenvolve um sistema de recuperação da informação baseado em LLMs e RAG, aplicado à lista RENAME de medicamentos essenciais. O protótipo utiliza embeddings, bancos vetoriais e LangChain, com avaliação realizada pelo framework RAGAS.
all-minilm-l6-v2 analytics chunking data-science gemma-2-9b-it genai groq langchain langchain-agent llama3 llm mixtral-8x7b pinecone postgresql rag ragas rename scraping streamlit usp
Last synced: 04 Apr 2026
https://github.com/jonathanfavorite/ragamuffin
A lightweight, cross-platform .NET library for building RAG (Retrieval-Augmented Generation) pipelines with local embedding models and SQLite vector storage. Perfect for developers who need privacy-focused, offline-capable document search and AI-powered question answering without external API dependencies.
ai chunking document-processing dotnet embedding-models fluent-api local-ai metadata ml nlp offline-ai onnx pdf-processing privacy-focused rag retrieval-augmented-generation semantic-search sqlite vector-database vector-search
Last synced: 02 Jun 2026