https://github.com/manticoresoftware/php-ext-tokenizer
A tokenizer library for PHP to use HuggingFace tokenizers
https://github.com/manticoresoftware/php-ext-tokenizer
Last synced: 27 days ago
JSON representation
A tokenizer library for PHP to use HuggingFace tokenizers
- Host: GitHub
- URL: https://github.com/manticoresoftware/php-ext-tokenizer
- Owner: manticoresoftware
- Created: 2024-03-23T07:59:12.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2024-03-23T08:15:51.000Z (about 2 years ago)
- Last Synced: 2025-01-16T11:37:10.779Z (over 1 year ago)
- Language: Rust
- Size: 13.7 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# php-ext-tokenizer
This extension enables the use of the tokenizer library from HuggingFace, allowing you to tokenize any text with a provided JSON file.
## How to Use
Here's an example of how to use it:
```php
tokenize("Hello world"));
var_dump($tokenizer->encode("Hello world"));
```
This will display the tokenized and encoded results for the given configuration and text:
```text
array(2) {
[0]=>
string(5) "hello"
[1]=>
string(5) "world"
}
array(2) {
[0]=>
int(7592)
[1]=>
int(2088)
}
```
## How to Build
You need to have `cargo` installed to build.
```bash
cargo build --release
```
After the build is complete, you can use the extension with PHP as usual.
```bash
$ php -d extension=target/release/libphp_ext_tokenizer.so -r 'var_dump(class_exists("Manticore\Ext\Tokenizer"));'
bool(true)
```