An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with chunking

A curated list of projects in awesome lists tagged with chunking .

https://github.com/chonkie-ai/chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library

ai chunking etl nlp python rag retrieval semantic-segmentation text-chunking text-processing text-splitting vector-search

Last synced: 14 May 2025

https://github.com/jiesutd/NCRFpp

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.

artificial-intelligence char-cnn char-rnn chunking cnn crf lstm lstm-crf named-entity-recognition natural-language-processing nbest ner neural-networks part-of-speech-tagger pytorch sequence-labeling

Last synced: 09 Apr 2025

https://github.com/jiesutd/ncrfpp

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.

artificial-intelligence char-cnn char-rnn chunking cnn crf lstm lstm-crf named-entity-recognition natural-language-processing nbest ner neural-networks part-of-speech-tagger pytorch sequence-labeling

Last synced: 15 May 2025

https://github.com/bhavnicksm/chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library

ai chunking rag retrieval-augmented-generation text-processing

Last synced: 31 Jul 2025

https://github.com/systemd/casync

Content-Addressable Data Synchronization Tool

archive chunking delivery download file-system http synchronization tar upload

Last synced: 15 May 2025

https://github.com/mirth/chonky

Fully neural approach for text chunking

ai chunking llms ml rag semantic-chunking text-splitter

Last synced: 14 Jan 2026

https://github.com/smooks/smooks

An extensible Java framework for building event-driven applications that break up XML and non-XML data into chunks for data integration

analytics chunking enterprise-integration etl event-driven java pipelines sax smooks stream-processing xml

Last synced: 13 May 2025

https://github.com/folbricht/desync

Alternative casync implementation

archive casync chunking golang synchronization

Last synced: 18 Apr 2026

https://github.com/isaacus-dev/semchunk

A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.

chunking isaacus nlp python semantic-chunking splitting text text-chunking text-splitting

Last synced: 15 May 2025

https://github.com/microsoft/rag-experiment-accelerator

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

acs azure chunking dense embedding evaluation experiment genai indexing information-retrieval llm openai rag sparse vectors

Last synced: 16 May 2025

https://github.com/lazyFrogLOL/llmdocparser

A package for parsing PDFs and analyzing their content using LLMs.

chunking document-analysis llm nlp ocr pdf-parser pdfparser rag text-chunking

Last synced: 01 Apr 2025

https://github.com/26hzhang/neural_sequence_labeling

A TensorFlow implementation of Neural Sequence Labeling model, which is able to tackle sequence labeling tasks such as POS Tagging, Chunking, NER, Punctuation Restoration and etc.

chunking lstm-networks named-entity-recognition pos-tagger punctuation python3 sentence-boundary-detection sequence-labeling tensorflow

Last synced: 20 Aug 2025

https://github.com/swarmauri/swarmauri-sdk

a monorepo featuring modular microkernel frameworks and single purpose extensions

agents ai chunking factories llm-framework measures metrics modular monorepo nlp orchestration orchestration-framework parsing tooling tools vectors

Last synced: 20 May 2026

https://github.com/messkan/rag-chunk

A Python CLI to test, benchmark, and find the best RAG chunking strategy for your Markdown documents.

chunking document-chunking embedding-vectors ia langchain llm nlp python rag rag-pipeline retrieval-augmented-generation text-splitting vector-search

Last synced: 05 Mar 2026

https://github.com/jordicenzano/go-ts-segmenter

Live TS segmenter and HLS manifest creation in Go

chunk chunked chunking golang hls lhls transport-stream video

Last synced: 08 Jul 2025

https://github.com/jparkerweb/semantic-chunking

🍱 semantic-chunking ⇢ semantically create chunks from large document for passing to LLM workflows

chunking embeddings llm semantic-chunking text-chunking text-splitter text-splitting vector

Last synced: 01 May 2025

https://github.com/xtabbas/The-Ultimate-Boilerplate

webpack 2, react hotloader 3, react router v4, code splitting and more

boilerplate chunking hot-reloading react react-router-v4 reactrouter redux server-side-rendering webpack

Last synced: 06 Aug 2025

https://github.com/sammyjo20/laravel-chunkable-jobs

đź“‘ Split Laravel jobs into multiple separate job chunks

chunking hacktoberfest jobs laravel php

Last synced: 26 Oct 2025

https://github.com/esastack/esa-restclient

An asynchronous event-driven HTTP client based on netty.

asynchronous chunking filter h2c haproxy http2 httpclient https interceptor netty retry

Last synced: 02 May 2025

https://github.com/ronomon/deduplication

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.

chunking content-dependent deduplication nodejs

Last synced: 17 Aug 2025

https://github.com/neondatabase-labs/pgrag

Postgres extensions to support end-to-end Retrieval-Augmented Generation (RAG) pipelines

chunking embeddings pgrx postgresql rag

Last synced: 10 Oct 2025

https://github.com/drmingler/smart-llm-loader

smart-llm-loader is a lightweight yet powerful Python package that transforms any document into LLM-ready chunks. Spend less time on preprocessing headaches and more time building what matters. From RAG systems to chatbots to document Q&A, SmartLLMLoader handles the heavy lifting so you can focus on creating exceptional AI applications.

chatbot chunking claude gemini langchain llama-index markdown openai pdf-converter pdf-parser pdf-to-markdown rag

Last synced: 31 Jul 2025

https://github.com/speedyk-005/chunklet-py

One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.

ai chunking chunks-algorithm chunks-processing code-chunking code-structure document-chunking natural-language-processing nlp rag text-splitting visualization

Last synced: 02 Mar 2026

https://github.com/bnosac/crfsuite

Labelling Sequential Data in Natural Language Processing with R - using CRFsuite

chunking conditional-random-fields crf crfsuite data-science intent-classification natural-language-processing ner nlp r r-package

Last synced: 15 Mar 2026

https://github.com/iscc/fastcdc-py

FastCDC implementation in Python https://pypi.org/project/fastcdc/

chunking chunking-algorithm content-dependent deduplication python

Last synced: 17 Feb 2026

https://github.com/howardyclo/grammar-pattern

Extract and align grammar patterns from English sentences.

chunking grammar grammar-parser grammar-pattern grammar-rules shallow-parser

Last synced: 24 Oct 2025

https://github.com/carlosplanchon/betterhtmlchunking

BetterHTMLChunking is a Python library for intelligent HTML segmentation. It builds a DOM tree from raw HTML and extracts content-rich regions of interest, making content analysis effortless. Great for LLM based processing.

ai chunking html llm splitting

Last synced: 15 Sep 2025

https://github.com/documentatom/documentatom

DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.

ai chunk chunking etl extraction extraction-transformation-and-loading parse parser semantic

Last synced: 31 Oct 2025

https://github.com/danengelbrecht/golongtail

Command line front end for longtail synchronization tool

archive chunking delivery download gcs gcs-bucket s3 s3-storage synchronization upload

Last synced: 06 Jul 2025

https://github.com/drittich/semanticslicer

🧠✂️ SemanticSlicer — A smart text chunker for LLM-ready documents.

ai azure-openai chat-gpt chatgpt chunker chunking embeddings gpt gpt-4 langchain llm openai text-chunking

Last synced: 21 Aug 2025

https://github.com/duriantaco/pykomodo

A Python-based parallel file chunking system designed for processing large codebases into LLM-friendly chunks.

chunking llm python python3

Last synced: 13 Apr 2025

https://github.com/nftstorage/carbites

***Notice: This repository is no longer maintained.*** đźš— đźš™ đźš• Chunking for CAR files. Split a single CAR into multiple CARs.

car chunking cid ipld multiformats splitting

Last synced: 14 Jul 2025

https://github.com/zabuzard/fastcdc4j

Fast and efficient content-defined chunking for data deduplication. Java implementation of FastCDC as library.

cdc chunking content-defined-chunking data-deduplication fastcdc java library

Last synced: 05 Mar 2026

https://github.com/indyjo/cafs

Content-Addressable File System (used by BitWrk)

chunking deduplication download http rolling-hash synchronization upload

Last synced: 27 Dec 2025

https://github.com/remram44/cdchunking-rs

Content-Defined Chunking for Rust

chunk chunking rolling-hash-functions rust

Last synced: 17 Mar 2025

https://github.com/saltyrtc/chunked-dc-js

Binary chunking that can be reassembled out-of-order.

chunking javascript saltyrtc webrtc-datachannels

Last synced: 31 Jul 2025

https://github.com/jchunk-io/jchunk

JChunk is a lightweight and flexible library designed to provide multiple strategies for text chunking within Java applications

chunk chunking etl-pipeline java rag text-splitter text-splitting

Last synced: 10 Mar 2026

https://github.com/gregorbiswanger/semanticchunker.net

Embedding-driven, context-aware text chunking for Semantic Kernel and RAG workflows in .NET

ai chunking csharp dotnet embedding library llm rag semantic-kernel semanticchunker semantickernel slm text-chunking

Last synced: 12 Sep 2025

https://github.com/iprit/md-svg-vue

Material design icons by Google for Vue.js & Nuxt.js (server side support & inline svg with path)

bundling chunking icons inline-svg material-design server-side-rendering svg svg-icons vue vuejs2

Last synced: 13 Sep 2025

https://github.com/yaroslav/inkmark

A very fast, feature-packed, AI-first Markdown (CommonMark/GFM) gem for Ruby, based on pulldown-cmark (Rust).

ai chunking commonmark commonmark-parsing llm markdown markdown-language markdown-parser markdown-to-html pulldown-cmark rag ruby rubyonrails rust

Last synced: 26 May 2026

https://github.com/skerkour/go-benchmarks

Comprehensive and reproducible benchmarks for Go developers and architects.

benchmark benchmarking benchmarks cdc chunking go golang hash hashing

Last synced: 12 Apr 2025

https://github.com/fd0/split

Split large files into smaller ones using deterministic Content Defined Chunking

cdc chunking data split

Last synced: 18 Aug 2025

https://github.com/KernelPanic92/ngx-fastboot

ngx-fastboot is an Angular library designed to dynamically load configuration settings at runtime, optimizing application startup performance by offloading configurations to a separate compilation chunk.

angular boot chunk chunking configuration dynamic fastboot lazy npm performance providers typescript

Last synced: 05 Mar 2025

https://github.com/cckalen/intellichunk

Go Based Lightweight RAG / LLM Tool with CLI + API

ai apiserver chatbot chunking command-line-tool langchain llm

Last synced: 30 Dec 2025

https://github.com/kernelpanic92/ngx-fastboot

ngx-fastboot is an Angular library designed to dynamically load configuration settings at runtime, optimizing application startup performance by offloading configurations to a separate compilation chunk.

angular boot chunk chunking configuration dynamic fastboot lazy npm performance providers typescript

Last synced: 13 Oct 2025

https://github.com/gene-hightower/ghsmtp

Gene's SMTP server — receive Internet mail with less fuss

c-plus-plus chunking cpp cpp17 dkim dmarc rfc-5321 smtp smtp-client smtp-protocol smtp-server smtpd spf tls-support utf-8 utf8

Last synced: 24 Apr 2025

https://github.com/lelserslasers/minecraft

Minecraft clone with an infinite world generated from 3d perlin noise (no game engine)

3d chunking cpp infinite-world itch-io minecraft perlin-noise perlin-noise-3d raylib voxel voxel-engine

Last synced: 14 Oct 2025

https://github.com/adamfoneil/chunkupload

Library for implementing chunked uploads to Azure blob storage. Intended for use with DropzoneJS.

azure-storage chunking dropzonejs uploader

Last synced: 15 Jul 2025

https://github.com/jmaczan/bpe-tokenizer

Byte-Pair Encoding tokenizer for training large language models on huge datasets

bpe bpe-tokenizer byte-pair-encoding chunking deep-learning from-scratch large-language-models llm machine-learning python tokenizer

Last synced: 18 Sep 2025

https://github.com/lh0x00/docsifer

Docsifer is a powerful tool for converting various data formats into Markdown for applications such as indexing, text analysis, and more. It supports PDF, PowerPoint, Word, Excel, Images, Audio, HTML, and other text-based formats, and leverages LLMs to enhance performance.

analysis autogen chunking docsier documents emeddings indexing langchain llama-index markdown markitdown rag text-embeddings text-processing vector-database

Last synced: 23 Apr 2025

https://github.com/shelfio/array-chunk-by-size

Chunk array of objects by their size in JSON

arrays chunk chunking node-module npm-package splitting

Last synced: 25 Jun 2025

https://github.com/zchunk/zchunk-java

A java-native implementation of the zchunk file format

chunking compression datasaving file-transfer format

Last synced: 23 May 2026

https://github.com/gursv/url-summ

A URL summarizer, which summarizes the content of a URL with proper formatting. It uses 'sshleifer/distilbart-cnn-12-6', which is a distilled version of the BART model, specifically optimized for text summarization tasks, including CNN summarization.

ai beautifulsoup chunking formatted-text huggingface-models python3 smtp star-rating streamlit text-extraction text-summarization transformers url-summarization

Last synced: 23 Apr 2025

https://github.com/mg98/ae-chunker-go

Go implementation of the AE chunking algorithm.

chunking chunking-algorithm go golang

Last synced: 25 Feb 2026

https://github.com/kathleenwest/filemanagerdemo

(File Manager – A Demo of a WCF Self-Hosted Service & Client "Tester" Windows Form Application Exchanging Files) This project presents a simple File Manager Service and Client Application demonstration. The File Manager is a self-hosted (service host) WCF application launched and managed with a simple console interface. The client “tester” has a simplified GUI user interface to quickly demo and test the service (Windows Form Application).

chunk chunking csharp csharp-code csharp-library file-management file-manager file-manager-application file-server file-sharing file-transfer file-upload filemanager filemanager-ui stream streaming wcf wcf-client wcf-service wcf-service-client-demo

Last synced: 25 Jul 2025

https://github.com/lynixtaxic/docsifer

Docsifer is a powerful tool for converting various data formats into Markdown for applications such as indexing, text analysis, and more. It supports PDF, PowerPoint, Word, Excel, Images, Audio, HTML, and other text-based formats, and leverages LLMs to enhance performance.

analysis autogen chunking docsier documents emeddings indexing langchain llama-index markdown markitdown rag text-embeddings text-processing vector-database

Last synced: 23 Apr 2025

https://github.com/akshayxml/google-file-system

Implemented Google File System from its research paper.

chunking distributed-systems file-sharing filesystem google-file-system python3 replication

Last synced: 17 May 2026

https://github.com/jonahwhaler/llm-agent-toolkit

LLM AgeToolkit provides minimal, modular interfaces for core components in LLM-based applications.

agent chromadb chunking faiss llm modular-design ollama openai python tool-calling toolkit vision

Last synced: 05 May 2025

https://github.com/marlon360/nifty-uploader

⬆️ An easy file uploader for the Browser written in TypeScript

chunking javascript typescript uploader

Last synced: 12 Jul 2025

https://github.com/saltyrtc/chunked-dc-swift

Binary chunking that can be reassembled out-of-order.

chunking network-programming

Last synced: 15 May 2026

https://github.com/xyproto/projectinfo

Given a directory of source code, find the project name, contributors, collect the source code and output it all in JSON chunks with an upper token limit

chunking go project-info

Last synced: 04 Jan 2026

https://github.com/simon-zerisenay/42_push_swap

Pushswap is a 42 project emphasizing efficient sorting by minimizing operations. Participants use a limited set of commands to manipulate stacks and achieve the desired sorted order, showcasing algorithm design and optimization skills while developing problem-solving abilities.

42 42pushswap c chunking cprogramming ecole42 linkedlist midpoint pushswap sorting-algorithms stacks struct

Last synced: 18 Oct 2025

https://github.com/atayahmet/blobify

A Javascript automation tool to convert data (file, image etc.) to blob object and vice-versa.

blob blob-files blob-image chunking data-chunk inmemory-cache

Last synced: 22 Mar 2025

https://github.com/craigwardman/chunkingredisclient

A C# library which implements various wrappers around the StackExchange.Redis client, specifically using Newtonsoft.Json serialisation; Such as streamed reading/writing and sliding expiration.

chunking csharp extensions json net-core newtonsoft-json redis redis-client stackexchange-redis wrapper-library

Last synced: 19 Aug 2025

https://github.com/chonkie-inc/mtcb

🤔 wondering if your chunks are good? 🦉 Judie is here to Judge and Evaluate your Chunks! ✨

ai benchmarking chunk chunking judge llm-evaluation observability rag

Last synced: 10 Mar 2026

https://github.com/print3m/chunkmap

ChunkMap is a command-line tool to split large Nmap scans into savable chunks.

chunking command-line-tool nmap nmap-automation nmap-script port-scanner portscanner python python-script

Last synced: 25 May 2026

https://github.com/mirpo/chopdoc

A tool to split documents into chunks for RAG and LLM applications

chunking data-engineering filtering gemini llm openai pipeline rag

Last synced: 02 May 2026

https://github.com/acj/file-chunker

Divide a file into evenly-sized chunks

chunking concurrency parallel text-processing

Last synced: 24 Feb 2026

https://github.com/parthapray/docling_rag_langchain_colab

This repo contains codes for RAG using docling on colab notebook with langchain, milvus, huggingface embedding model and LLM

all-minilm-l6-v2 chunking colab-notebook docling huggingface langchain large-language-models milvus pdf retrieval-augmented-generation sentence-transformers

Last synced: 18 May 2026

https://github.com/rse/chunking

Simple Task Chunking

chunking rate-limiting task throttling

Last synced: 16 Jul 2025

https://github.com/abitofhelp/optimized_adaptive_pipeline_rs

Adaptive Rust pipeline for high-throughput file processing—dynamic chunking, parallelism, AES/ChaCha encryption, backpressure, and Prometheus/tracing.

adaptive-concurrency backpressure chunking concurrency data-pipeline encryption file-processing metrics observability opentelemetry parallelism prometheus rust stream-processing tracing

Last synced: 05 Oct 2025

https://github.com/stevewyl/chunk_segmentor

Word Segmentaor with Noun Phrase based on HanLP

chunking keras segmentation

Last synced: 15 May 2026

https://github.com/parthapray/docling_colab

This repo contains google colab notebook for handing Docling for data extraction such as text, image, table etc.

chunk chunking colab-notebook docling docx embed extraction-data image lancedb markdown pdf pptx retrieval-augmented-generation table text transformers

Last synced: 16 May 2026

https://github.com/abitofhelp/adaptive_pipeline

Adaptive Rust pipeline for high-throughput file processing—dynamic chunking, parallelism, AES/ChaCha encryption, backpressure, and Prometheus/tracing.

adaptive-concurrency backpressure chunking concurrency data-pipeline encryption file-processing metrics observability opentelemetry parallelism prometheus rust stream-processing tracing

Last synced: 17 May 2026

https://github.com/sanix-darker/split

A FROM SCRATCH module able to decompose and recompose a file based on a map-JSON-schema build using md5 and Base64.

block chunk chunking chunks map split splitting

Last synced: 19 Jan 2026

https://github.com/saltyrtc/chunked-dc-java

Binary chunking that can be reassembled out-of-order.

chunking java saltyrtc

Last synced: 26 Feb 2025

https://github.com/leo310/rag-chunking-evaluation

Assess the effectiveness of chunking strategies in RAG systems via a custom evaluation framework.

chunking evaluation-framework retrieval retrieval-augmented-generation

Last synced: 22 Jan 2026

https://github.com/duit-foundation/chunk_norris

That’s not a kick… THIS is a kick! A simple pure Dart library for working with chunked JSON

chunking dart flutter json sse

Last synced: 18 Apr 2026

https://github.com/kimtth/rag-multimodal-semantic-chunking

🖼️📄E2E Multi-modal Document Preprocessing for Search Indexing with Azure Document Intelligence

azure-document-intelligence chunking image-understanding rag-preparation workshop

Last synced: 05 Aug 2025

https://github.com/yuma-shintani/chunksize-checker

Calculate the number of total tokens, optimal chunk size and chunk overlap from any given document.

chunking electron rag

Last synced: 10 May 2026

https://github.com/i-partalas/industrial-rag-qna-benchmark

Benchmarking the performance of proprietary vs open-source LLMs in industrial QnA tasks using various RAG-based implementations and evaluation metrics.

azureopenai benchmarking chromadb chunking docker huggingface langchain large-language-models llms-benchmarking metrics openai pytorch retrieval-augmented-generation streamlit synthetic-dataset-generation

Last synced: 28 Jan 2026

https://github.com/hamolicious/chunky

A chunking system for game developement

chunking chunks library pypi pypi-package python

Last synced: 24 Feb 2025

https://github.com/ayush585/smartchunk

SmartChunk is a lightweight, structure-aware semantic chunking toolkit designed to supercharge RAG (Retrieval-Augmented Generation) and LLM pipelines. Unlike naive splitters that break text arbitrarily, SmartChunk respects document structure (headings, lists, tables, code blocks) and semantic flow, ensuring cleaner, more coherent chunks.

agentic-workflow chunking chunking-algorithm cli llm nlp package pip rag semantic

Last synced: 07 Sep 2025

https://github.com/zircote/rlm-rs-plugin

Claude Code plugin for processing documents 100x larger than context limits using the Recursive Language Model pattern. Rust-powered chunking, hybrid semantic + BM25 search, and sub-LLM orchestration.

ai-agents bm25 chunking claude-code claude-code-plugin document-processing hybrid-search llm long-context recursive-language-model rlm rust semantic-search sqlite

Last synced: 08 Apr 2026

https://github.com/nihar3453/llm-transformers-and-rag

A hands-on suite for exploring and fine-tuning foundation models (Transformers, BERT, GPT-2, BART) and end-to-end RAG pipelines with attention visualizations, semantic search (ChromaDB/Weaviate), LangChain workflow demos.

bart bert chromadb chunking generative-ai hnsw huggingface-transformers langchain llms minigpt rag transformers vector-database weav

Last synced: 14 Apr 2026

https://github.com/ziffan/chunklab

ChunkLab is a powerful browser-based sandbox designed for developers to test, visualize, and validate text chunking pipeline configurations. Optimize your RAG (Retrieval-Augmented Generation) ingestion process with real-time feedback and detailed metrics.

ai chunking data-preprocessing developer-tools embeddings fastapi llm nlp playground python rag react regex sandbox text-processing tiktoken tokenization vector-database

Last synced: 02 May 2026

https://github.com/nathadriele/acmr-rag-rename-mbausp

Trabalho de Conclusão de Curso do MBA em Data Science e Analytics da USP/ESALQ, turma 2023. Desenvolve um sistema de recuperação da informação baseado em LLMs e RAG, aplicado à lista RENAME de medicamentos essenciais. O protótipo utiliza embeddings, bancos vetoriais e LangChain, com avaliação realizada pelo framework RAGAS.

all-minilm-l6-v2 analytics chunking data-science gemma-2-9b-it genai groq langchain langchain-agent llama3 llm mixtral-8x7b pinecone postgresql rag ragas rename scraping streamlit usp

Last synced: 04 Apr 2026

https://github.com/jonathanfavorite/ragamuffin

A lightweight, cross-platform .NET library for building RAG (Retrieval-Augmented Generation) pipelines with local embedding models and SQLite vector storage. Perfect for developers who need privacy-focused, offline-capable document search and AI-powered question answering without external API dependencies.

ai chunking document-processing dotnet embedding-models fluent-api local-ai metadata ml nlp offline-ai onnx pdf-processing privacy-focused rag retrieval-augmented-generation semantic-search sqlite vector-database vector-search

Last synced: 02 Jun 2026