https://github.com/lepisma/tokenizers.el
Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library
https://github.com/lepisma/tokenizers.el
emacs-lisp rust tokenizers
Last synced: 3 months ago
JSON representation
Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library
- Host: GitHub
- URL: https://github.com/lepisma/tokenizers.el
- Owner: lepisma
- Created: 2024-12-31T13:09:24.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-01-11T18:29:13.000Z (5 months ago)
- Last Synced: 2025-01-11T19:35:14.148Z (5 months ago)
- Topics: emacs-lisp, rust, tokenizers
- Language: Rust
- Homepage:
- Size: 16.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.org
Awesome Lists containing this project
README
#+TITLE: tokenizers.el
Fast tokenizers for Emacs Lisp backed by [[https://github.com/huggingface/tokenizers][Huggingface's rust library]].
To start with, this will support inference from pretrained models since that's
what needed in [[https://github.com/lepisma/onnx.el][onnx.el]].* Installation
There is a dynamic module that needs you to have cargo installed. This module is
automatically compiled when you install the package using something like this:#+begin_src emacs-lisp
(use-package tokenizers
:vc (:fetcher github :repo lepisma/tokenizers.el)
:demand t)
#+end_srcNote that this method only handles creating ~.so~ files which means it won't work
on non-Linux systems yet. For other systems, manually inspect the ~Makefile~ and
compile the module yourself.* Usage
#+begin_src emacs-lisp
(require 'tokenizers)
(setq tk (tokenizers-from-pretrained "sentence-transformers/all-MiniLM-L6-v2"));; Returns a list of three vectors, token-ids, type-ids, and attention-mask
;; Last argument tells whether to add special tokens
(tokenizers-encode tk "Test sentence with some words" t);; Returns a list of three 2d vectors (batch size is the first dimension),
;; token-ids, type-ids, and attention-mask
(tokenizers-encode-batch tk ["This is an example sentence" "Each sentence is converted"] t)
#+end_src