An open API service indexing awesome lists of open source software.

https://github.com/lepisma/tokenizers.el

Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library
https://github.com/lepisma/tokenizers.el

emacs-lisp rust tokenizers

Last synced: 3 months ago
JSON representation

Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library

Awesome Lists containing this project

README

        

#+TITLE: tokenizers.el

Fast tokenizers for Emacs Lisp backed by [[https://github.com/huggingface/tokenizers][Huggingface's rust library]].

To start with, this will support inference from pretrained models since that's
what needed in [[https://github.com/lepisma/onnx.el][onnx.el]].

* Installation
There is a dynamic module that needs you to have cargo installed. This module is
automatically compiled when you install the package using something like this:

#+begin_src emacs-lisp
(use-package tokenizers
:vc (:fetcher github :repo lepisma/tokenizers.el)
:demand t)
#+end_src

Note that this method only handles creating ~.so~ files which means it won't work
on non-Linux systems yet. For other systems, manually inspect the ~Makefile~ and
compile the module yourself.

* Usage
#+begin_src emacs-lisp
(require 'tokenizers)
(setq tk (tokenizers-from-pretrained "sentence-transformers/all-MiniLM-L6-v2"))

;; Returns a list of three vectors, token-ids, type-ids, and attention-mask
;; Last argument tells whether to add special tokens
(tokenizers-encode tk "Test sentence with some words" t)

;; Returns a list of three 2d vectors (batch size is the first dimension),
;; token-ids, type-ids, and attention-mask
(tokenizers-encode-batch tk ["This is an example sentence" "Each sentence is converted"] t)
#+end_src