https://github.com/anush008/tokenizers
Multi-arch bindings for @huggingface/tokenizers.
https://github.com/anush008/tokenizers
huggingface tokenizers
Last synced: 3 months ago
JSON representation
Multi-arch bindings for @huggingface/tokenizers.
- Host: GitHub
- URL: https://github.com/anush008/tokenizers
- Owner: Anush008
- License: mit
- Created: 2023-09-16T14:21:39.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-09-17T07:33:59.000Z (over 1 year ago)
- Last Synced: 2024-10-27T22:51:37.543Z (8 months ago)
- Topics: huggingface, tokenizers
- Language: Rust
- Homepage:
- Size: 893 KB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# @anush008/tokenizers
The official Node bindings are in a jinx with limited support for Node versions and architectures. This package offers multi-arch bindings for [@huggingface/tokenizers](https://github.com/huggingface/tokenizers) with Node v20.x supported.
## Supports:
> * Windows x86_64
> * Linux x86_64
> * MacOS aarch64/x86_64## Installation
```bash
npm install @anush008/tokenizers
```## Features
- Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
most common BPE versions).
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the
original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.## Basic example
```ts
import { Tokenizer } from "@anush008/tokenizers";const tokenizer = await Tokenizer.fromFile("tokenizer.json");
const wpEncoded = await tokenizer.encode("Who is John?");console.log(wpEncoded.getLength());
console.log(wpEncoded.getTokens());
console.log(wpEncoded.getIds());
console.log(wpEncoded.getAttentionMask());
console.log(wpEncoded.getOffsets());
console.log(wpEncoded.getOverflowing());
console.log(wpEncoded.getSpecialTokensMask());
console.log(wpEncoded.getTypeIds());
console.log(wpEncoded.getWordIds());
```## License
[MIT](LICENSE)