https://github.com/preciz/chunx

Elixir library for splitting text into meaningful chunks using various strategies
https://github.com/preciz/chunx

Last synced: 24 days ago
JSON representation

Elixir library for splitting text into meaningful chunks using various strategies

Host: GitHub
URL: https://github.com/preciz/chunx
Owner: preciz
License: mit
Created: 2024-12-05T05:43:00.000Z (6 months ago)
Default Branch: master
Last Pushed: 2025-01-29T18:18:05.000Z (4 months ago)
Last Synced: 2025-05-11T11:49:53.662Z (24 days ago)
Language: Elixir
Homepage:
Size: 31.3 KB
Stars: 10
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Chunx

[![test](https://github.com/preciz/chunx/actions/workflows/test.yml/badge.svg)](https://github.com/preciz/chunx/actions/workflows/test.yml)

Chunx is an Elixir library for splitting text into meaningful chunks using various strategies. It's particularly useful for processing large texts for LLMs, semantic search, and other NLP tasks.

## Credit

This library is based on [chonkie-ai/chonkie](https://github.com/chonkie-ai/chonkie)

## Features

- Multiple chunking strategies:

  - Token-based chunking

  - Word-based chunking

  - Sentence-based chunking

  - Semantic chunking with embeddings

- Configurable options for each strategy

- Support for overlapping chunks

- Token count tracking

- Embedding support

## Installation

Add `chunx` to your list of dependencies in `mix.exs`:

```elixir

def deps do

  [

    {:chunx, github: "preciz/chunx"}

  ]

end

```

## Usage

### Token-based Chunking

```elixir

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")

{:ok, chunks} = Chunx.Chunker.Token.chunk("Your text here", tokenizer, chunk_size: 512)

```

### Word-based Chunking

```elixir

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")

{:ok, chunks} = Chunx.Chunker.Word.chunk("Your text here", tokenizer, chunk_size: 512)

```

### Sentence-based Chunking

```elixir

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")

{:ok, chunks} = Chunx.Chunker.Sentence.chunk("Your text here", tokenizer)

```

### Semantic Chunking

```elixir

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")

# The embedding function must return a list of Nx.Tensor.t()

embedding_fn = fn texts ->

  # Your embedding function here

end

{:ok, chunks} = Chunx.Chunker.Semantic.chunk("Your text here", tokenizer, embedding_fn)

```

## Configuration

Each chunking strategy accepts various options to customize the chunking behavior:

- `chunk_size`: Maximum number of tokens per chunk

- `chunk_overlap`: Number of tokens or percentage to overlap between chunks

- `min_sentences`: Minimum number of sentences per chunk (for sentence-based)

- `threshold`: Similarity threshold for semantic chunking

- And more...

See the documentation for each chunker module for detailed configuration options.

## Testing

```elixir

# Run the test suite

mix test

```

## License

[MIT License](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/preciz/chunx

Awesome Lists containing this project

README