https://github.com/anush008/tokenizers

Multi-arch bindings for @huggingface/tokenizers.
https://github.com/anush008/tokenizers

huggingface tokenizers

Last synced: 3 months ago
JSON representation

Multi-arch bindings for @huggingface/tokenizers.

Host: GitHub
URL: https://github.com/anush008/tokenizers
Owner: Anush008
License: mit
Created: 2023-09-16T14:21:39.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-09-17T07:33:59.000Z (over 1 year ago)
Last Synced: 2024-10-27T22:51:37.543Z (8 months ago)
Topics: huggingface, tokenizers
Language: Rust
Homepage:
Size: 893 KB
Stars: 5
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # @anush008/tokenizers

The official Node bindings are in a jinx with limited support for Node versions and architectures. This package offers multi-arch bindings for [@huggingface/tokenizers](https://github.com/huggingface/tokenizers) with Node v20.x supported. 

## Supports:

> * Windows x86_64

> * Linux x86_64

> * MacOS aarch64/x86_64

## Installation

```bash

npm install @anush008/tokenizers

```

## Features

 - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3

   most common BPE versions).

 - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes

   less than 20 seconds to tokenize a GB of text on a server's CPU.

 - Easy to use, but also extremely versatile.

 - Designed for research and production.

 - Normalization comes with alignments tracking. It's always possible to get the part of the

   original sentence that corresponds to a given token.

 - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

## Basic example

```ts

import { Tokenizer } from "@anush008/tokenizers";

const tokenizer = await Tokenizer.fromFile("tokenizer.json");

const wpEncoded = await tokenizer.encode("Who is John?");

console.log(wpEncoded.getLength());

console.log(wpEncoded.getTokens());

console.log(wpEncoded.getIds());

console.log(wpEncoded.getAttentionMask());

console.log(wpEncoded.getOffsets());

console.log(wpEncoded.getOverflowing());

console.log(wpEncoded.getSpecialTokensMask());

console.log(wpEncoded.getTypeIds());

console.log(wpEncoded.getWordIds());

```

## License

[MIT](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anush008/tokenizers

Awesome Lists containing this project

README