https://github.com/murrellgroup/huggingfacetokenizers.jl
Julia wrapper of a Python wrapper of a Rust library
https://github.com/murrellgroup/huggingfacetokenizers.jl
Last synced: 8 months ago
JSON representation
Julia wrapper of a Python wrapper of a Rust library
- Host: GitHub
- URL: https://github.com/murrellgroup/huggingfacetokenizers.jl
- Owner: MurrellGroup
- License: mit
- Created: 2024-11-22T21:08:07.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-15T09:19:54.000Z (over 1 year ago)
- Last Synced: 2025-01-15T11:14:17.651Z (over 1 year ago)
- Language: Julia
- Homepage:
- Size: 17.6 KB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HuggingFaceTokenizers.jl
[](https://github.com/MurrellGroup/HuggingFaceTokenizers.jl/actions/workflows/CI.yml?query=branch%3Amain)
[](https://codecov.io/gh/MurrellGroup/HuggingFaceTokenizers.jl)
Rudimentary Julia bindings for [🤗 Tokenizers](https://github.com/huggingface/tokenizers), providing fast and easy-to-use tokenization through Python interop.
## Installation
From the Julia REPL, enter Pkg mode with `]` and add the package using the URL:
```
add HuggingFaceTokenizers
```
## Usage
### Loading a Tokenizer
You can load a tokenizer either from a pre-trained model or from a saved file:
```julia
using HuggingFaceTokenizers
# Load a pre-trained tokenizer
tokenizer = from_pretrained(Tokenizer, "bert-base-uncased")
# Alternatively specify revision and auth token
tokenizer = from_pretrained(Tokenizer, "bert-base-uncased", "main", nothing)
# Or load from a file
tokenizer = from_file(Tokenizer, "path/to/tokenizer.json")
```
### Basic Operations
#### Single Text Processing
```julia
# Encode a single text
text = "Hello, how are you?"
result = encode(tokenizer, text)
println("Tokens: ", result.tokens)
println("IDs: ", result.ids)
# Decode back to text
decoded_text = decode(tokenizer, result.ids)
println("Decoded: ", decoded_text)
```
#### Batch Processing
```julia
# Encode multiple texts at once
texts = ["Hello, how are you?", "I'm doing great!"]
batch_results = encode_batch(tokenizer, texts)
# Each result contains tokens and ids
for (i, result) in enumerate(batch_results)
println("Text $i:")
println(" Tokens: ", result.tokens)
println(" IDs: ", result.ids)
end
# Decode multiple sequences at once
ids_batch = [result.ids for result in batch_results]
decoded_texts = decode_batch(tokenizer, ids_batch)
```
### Saving a Tokenizer
```julia
# Save the tokenizer to a file
save(tokenizer, "path/to/save/tokenizer.json")
```