https://github.com/mirpo/chopdoc
A tool to split documents into chunks for RAG and LLM applications
https://github.com/mirpo/chopdoc
chunking data-engineering filtering gemini llm openai pipeline rag
Last synced: about 1 month ago
JSON representation
A tool to split documents into chunks for RAG and LLM applications
- Host: GitHub
- URL: https://github.com/mirpo/chopdoc
- Owner: mirpo
- License: mit
- Created: 2025-01-25T18:43:12.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-04-19T13:37:34.000Z (about 2 months ago)
- Last Synced: 2026-04-19T15:33:01.750Z (about 2 months ago)
- Topics: chunking, data-engineering, filtering, gemini, llm, openai, pipeline, rag
- Language: Go
- Homepage:
- Size: 117 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# chopdoc
A command-line tool for splitting documents into chunks, optimized for RAG (Retrieval-Augmented Generation) and LLM applications.
## Features
- Supports chunking methods: characters, words, sentences, recursive, markdown.
- Configurable chunk size and overlap
- Text cleaning and normalization
- JSONL output format
- Supported formats: txt (or any plain text)
## Installation
[Homebrew](https://brew.sh/):
```shell
brew tap mirpo/homebrew-tools
brew install chopdoc
```
Using `go install`:
```shell
go install github.com/mirpo/chopdoc@latest
```
### Local Build
```shell
git clone https://github.com/mirpo/chopdoc.git
cd chopdoc
make build
```
## Usage
```bash
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -clean aggressive
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -overlap 100
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -overlap 100 -method char -clean aggressive
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -overlap 100 -method word
chopdoc -input pg_essay.txt -output chunks.jsonl -size 10 -overlap 1 -method sentence
chopdoc -input pg_essay.txt -output chunks.jsonl -size 100 -overlap 0 -method recursive
chopdoc -input pg_essay.txt -output chunks.jsonl -size 100 -overlap 0 -method recursive
chopdoc -input pg_essay.txt -output chunks.jsonl -method markdown -strip-headers
chopdoc -input pg_essay.txt -output chunks.jsonl -method markdown -headers 1-2 -add-metadata
```
chopdoc can be piped:
```bash
cat pg_essay.txt | chopdoc -size 1 -method sentence
cat pg_essay.txt | chopdoc -size 1 -method sentence > piped.jsonl
cat pg_essay.txt | chopdoc -size 1 -method sentence -output output_as_arg.jsonl
```
### Options
```shell
-add-metadata
Include header metadata in output (default false, markdown method only)
-clean string
Cleaning mode: none, normal, aggressive (default "none")
-headers string
Header levels to use for markdown method (e.g. 1-6, 2-4) (default "1-6")
-input string
Input file path
-method string
Default chunking method: char (default "char")
-output string
Output file path (must end with .jsonl)
-overlap int
Overlap size in characters
-size int
Chunk size in characters (default 1000)
-strip-headers
Remove headers from content (default false, markdown method only)
-version
Get current version of chopdoc
```
### Output Format
Each chunk is written as a JSON line:
```json
{"chunk": "content here"}
```
## Contributing
1. Fork the repository
2. Create your feature branch
3. Run tests: `go test ./...`
4. Submit a pull request
## License
MIT