https://github.com/mirpo/chopdoc
A tool to split documents into chunks for RAG and LLM applications
https://github.com/mirpo/chopdoc
chunking llm rag
Last synced: about 1 month ago
JSON representation
A tool to split documents into chunks for RAG and LLM applications
- Host: GitHub
- URL: https://github.com/mirpo/chopdoc
- Owner: mirpo
- License: mit
- Created: 2025-01-25T18:43:12.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-03-06T19:26:43.000Z (about 2 months ago)
- Last Synced: 2025-03-06T20:23:22.519Z (about 2 months ago)
- Topics: chunking, llm, rag
- Language: Go
- Homepage:
- Size: 96.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# chopdoc
A command-line tool for splitting documents into chunks, optimized for RAG (Retrieval-Augmented Generation) and LLM applications.
## Features
- Supports chunking methods: characters, words, sentences, recursive, markdown.
- Configurable chunk size and overlap
- Text cleaning and normalization
- JSONL output format
- Supported formats: txt (or any plain text)## Installation
[Homebrew](https://brew.sh/):
```shell
brew tap mirpo/homebrew-tools
brew install chopdoc
```Using `go install`:
```shell
go install github.com/mirpo/chopdoc@latest
```### Local Build
```shell
git clone https://github.com/mirpo/chopdoc.git
cd chopdoc
make build
```## Usage
```bash
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -clean aggressive
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -overlap 100
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -overlap 100 -method char -clean aggressive
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -overlap 100 -method word
chopdoc -input pg_essay.txt -output chunks.jsonl -size 10 -overlap 1 -method sentence
chopdoc -input pg_essay.txt -output chunks.jsonl -size 100 -overlap 0 -method recursive
chopdoc -input pg_essay.txt -output chunks.jsonl -size 100 -overlap 0 -method recursive
chopdoc -input pg_essay.txt -output chunks.jsonl -method markdown -strip-headers
chopdoc -input pg_essay.txt -output chunks.jsonl -method markdown -headers 1-2 -add-metadata
```chopdoc can be piped:
```bash
cat pg_essay.txt | chopdoc -size 1 -method sentence
cat pg_essay.txt | chopdoc -size 1 -method sentence > piped.jsonl
cat pg_essay.txt | chopdoc -size 1 -method sentence -output output_as_arg.jsonl
```### Options
```shell
-add-metadata
Include header metadata in output (default false, markdown method only)
-clean string
Cleaning mode: none, normal, aggressive (default "none")
-headers string
Header levels to use for markdown method (e.g. 1-6, 2-4) (default "1-6")
-input string
Input file path
-method string
Default chunking method: char (default "char")
-output string
Output file path (must end with .jsonl)
-overlap int
Overlap size in characters
-size int
Chunk size in characters (default 1000)
-strip-headers
Remove headers from content (default false, markdown method only)
-version
Get current version of sentences
```### Output Format
Each chunk is written as a JSON line:
```json
{"chunk": "content here"}
```## Contributing
1. Fork the repository
2. Create your feature branch
3. Run tests: `go test ./...`
4. Submit a pull request## License
MIT