Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ahrefs/ocaml-sentencepiece
OCaml bindings to SentencePiece
https://github.com/ahrefs/ocaml-sentencepiece
Last synced: about 1 month ago
JSON representation
OCaml bindings to SentencePiece
- Host: GitHub
- URL: https://github.com/ahrefs/ocaml-sentencepiece
- Owner: ahrefs
- Created: 2023-10-06T11:52:05.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-22T09:39:10.000Z (about 1 year ago)
- Last Synced: 2024-12-16T19:17:42.400Z (about 1 month ago)
- Language: OCaml
- Size: 142 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OCaml bindings to SentencePiece
[SentencePiece](https://github.com/google/sentencepiece) is an unsupervised text tokenizer.
Tested on [v0.1.99](https://github.com/google/sentencepiece/releases/tag/v0.1.99).## Set up
```
sudo apt install libsentencepiece-dev
```You also need a sentencepiece model ([example](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/sentencepiece.bpe.model)).
## Example
```ocaml
utop # open Sentencepiece;;
utop # #install_printer Processor.pp_int64_array1;;
utop # let p = Processor.load_model "sentencepiece.bpe.model" |> Result.get_ok;;
val p : Processor.t =
utop # Processor.encode_int64_ids p "Hey there!";;
- : (int64, Bigarray.int64_elt, Bigarray.c_layout) Bigarray.Array1.t =
[|1; 28239; 2684; 37; 2; |]
utop # Processor.encode_pieces p "Hey there!";;
- : string list = [""; "▁Hey"; "▁there"; "!"; ""]
utop # Processor.(decode_pieces p @@ encode_pieces p "Hey there!");;
- : string = "Hey there!"
```