https://github.com/preciz/chunx
Elixir library for splitting text into meaningful chunks using various strategies
https://github.com/preciz/chunx
Last synced: 24 days ago
JSON representation
Elixir library for splitting text into meaningful chunks using various strategies
- Host: GitHub
- URL: https://github.com/preciz/chunx
- Owner: preciz
- License: mit
- Created: 2024-12-05T05:43:00.000Z (6 months ago)
- Default Branch: master
- Last Pushed: 2025-01-29T18:18:05.000Z (4 months ago)
- Last Synced: 2025-05-11T11:49:53.662Z (24 days ago)
- Language: Elixir
- Homepage:
- Size: 31.3 KB
- Stars: 10
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Chunx
[](https://github.com/preciz/chunx/actions/workflows/test.yml)
Chunx is an Elixir library for splitting text into meaningful chunks using various strategies. It's particularly useful for processing large texts for LLMs, semantic search, and other NLP tasks.
## Credit
This library is based on [chonkie-ai/chonkie](https://github.com/chonkie-ai/chonkie)
## Features
- Multiple chunking strategies:
- Token-based chunking
- Word-based chunking
- Sentence-based chunking
- Semantic chunking with embeddings- Configurable options for each strategy
- Support for overlapping chunks
- Token count tracking
- Embedding support## Installation
Add `chunx` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:chunx, github: "preciz/chunx"}
]
end
```## Usage
### Token-based Chunking
```elixir
{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
{:ok, chunks} = Chunx.Chunker.Token.chunk("Your text here", tokenizer, chunk_size: 512)
```### Word-based Chunking
```elixir
{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
{:ok, chunks} = Chunx.Chunker.Word.chunk("Your text here", tokenizer, chunk_size: 512)
```### Sentence-based Chunking
```elixir
{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
{:ok, chunks} = Chunx.Chunker.Sentence.chunk("Your text here", tokenizer)
```### Semantic Chunking
```elixir
{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")# The embedding function must return a list of Nx.Tensor.t()
embedding_fn = fn texts ->
# Your embedding function here
end{:ok, chunks} = Chunx.Chunker.Semantic.chunk("Your text here", tokenizer, embedding_fn)
```## Configuration
Each chunking strategy accepts various options to customize the chunking behavior:
- `chunk_size`: Maximum number of tokens per chunk
- `chunk_overlap`: Number of tokens or percentage to overlap between chunks
- `min_sentences`: Minimum number of sentences per chunk (for sentence-based)
- `threshold`: Similarity threshold for semantic chunking
- And more...See the documentation for each chunker module for detailed configuration options.
## Testing
```elixir
# Run the test suite
mix test
```## License
[MIT License](LICENSE)