An open API service indexing awesome lists of open source software.

https://github.com/jerinphilip/slimt

Inference slice of marian for bergamot's tiny11 models. Faster to compile, and wield. Fewer model-archs than bergamot-translator.
https://github.com/jerinphilip/slimt

cpp20 inference-engine machine-translation pybind11 python

Last synced: 11 months ago
JSON representation

Inference slice of marian for bergamot's tiny11 models. Faster to compile, and wield. Fewer model-archs than bergamot-translator.

Awesome Lists containing this project

README

          

# slimt

**slimt** (_slɪm tiː_) is an inference frontend for
[tiny](https://github.com/browsermt/students/tree/master/deen/ende.student.tiny11)
[models](https://github.com/browsermt/students) trained as part of the
[Bergamot project](https://browser.mt/).

[bergamot-translator](https://github.com/browsermt/bergamot-translator/) builds
on top of [marian-dev](https://github.com/marian-nmt/marian-dev) and uses the
inference code-path from marian-dev. While marian is a a capable neural network
library with focus on machine translation, all the bells and whistles that come
with it are not necessary to run inference on client-machines (e.g: autograd,
multiple sequence-to-sequence architecture support, beam-search). For some use
cases like an input-method engine doing translation (see
[lemonade](https://github.com/jerinphilip/lemonade)) - single-thread operation
existing along with other processes on the system suffices. This is the
motivation for this transplant repository. There's not much novel here except
easiness to wield. This repository is simply just the _tiny_ part of marian.
Code is reused where possible.

This effort is inspired by contemporary efforts like
[ggerganov/ggml](https://github.com/ggerganov/ggml) and
[karpathy/llama2](https://github.com/karpathy/llama2.c). _tiny_ models roughly
follow the [transformer architecture](https://arxiv.org/abs/1706.03762), with
[Simpler Simple Recurrent Units](https://aclanthology.org/D19-5632/) (SSRU) in
the decoder. The same models are used in Mozilla Firefox's [offline translation
addon](https://addons.mozilla.org/en-US/firefox/addon/firefox-translations/).

Both `tiny` and `base` models have 6 encoder-layers and 2 decoder-layers, and
for most existing pairs a vocabulary size of 32000 (with tied embeddings). The
following table briefly summarizes some architectural differences between
`tiny` and `base` models:

| Variant | emb | ffn | params | f32 | i8 |
| ------- | --- | --- | ------ | ----- | ---- |
| `base` | 512 | 2048 | 39.0M | 149MB | 38MB |
| `tiny` | 256 | 1536 | 15.7M | 61MB | 17MB |

The `i8` models, quantized to 8-bit and as small as 17MB is used to provide
translation for Mozilla Firefox's offline translation addon, among other
things.

More information on the models are described in the following papers:

* [From Research to Production and Back: Ludicrously Fast Neural Machine Translation](https://aclanthology.org/D19-5632)
* [Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task](https://aclanthology.org/2020.ngt-1.26/)

The large-list of dependencies from bergamot-translator have currently been
reduced to:

* For `int8_t` matrix-multiply [intgemm](https://github.com/kpu/intgemm)
(`x86_64`) or [ruy](https://github.com/google/ruy) (`aarch64`) or
[xsimd](https://github.com/xtensor-stack/xsimd) via
[gemmology](https://github.com/mozilla/gemmology).
* For vocabulary - [sentencepiece](https://github.com/browsermt/sentencepiece).
* For sentence-splitting using regular-expressions
[PCRE2](https://github.com/PCRE2Project/pcre2).
* For `sgemm` - Whatever BLAS provider is found via CMake (openblas,
intel-oneapimkl, cblas). Feel free to provide
[hints](https://cmake.org/cmake/help/latest/module/FindBLAS.html#blas-lapack-vendors).
* [CLI11](https://github.com/CLIUtils/CLI11/) (only a dependency for cmdline)

Source code is made public where basic functionality (text-translation) works
for English-German tiny models. Parity in features and speed with marian and
bergamot-translator (where relevant) is a work-in-progress. Eventual support for
`base` models are planned. Contributions are welcome and appreciated.

## Getting started

Clone with submodules.

```
git clone --recursive https://github.com/jerinphilip/slimt.git
```

Configure and build. `slimt` is still experimenting with CMake and
dependencies. The following, being prepared towards linux distribution should
work at the moment:

```bash
# Configure to use xsimd via gemmology
ARGS=(
# Use gemmology
-DWITH_GEMMOLOGY=ON

# On x86_64 machines use the following to enable a faster matrix
# multiplication backend using SIMD. All of these can co-exist and dispatch
# on best detecting CPU at runtime.
-DUSE_AVX512=ON -DUSE_AVX2=ON -DUSE_SSSE3=ON -DUSE_SSE2=ON

# Uncomment below line, comment x86_64 above and use for aarch64, armv7+neon)
# -DUSE_NEON=ON

# Use sentencepiece installed via system.
-DUSE_BUILTIN_SENTENCEPIECE=OFF

# Exports slimtConfig.cmake (cmake) and slimt.pc.in (pkg-config)
-DSLIMT_PACKAGE=ON

# Customize installation prefix if need be.
-DCMAKE_INSTALL_PREFIX=/usr/local
)

cmake -B build -S $PWD -DCMAKE_BUILD_TYPE=Release "${ARGS[@]}"
cmake --build build --target all

# Require sudo since /usr/local is writable usually only by root.
sudo cmake --build build --target install
```

The above run expects the packages `sentencepiece`, `xsimd` and a BLAS provider
to come from the system's package manager. Examples of this in distributions
include:

```bash
# Debian based systems
sudo apt-get install -y libxsimd-dev libsentencepiece-dev libopenblas-dev

# ArchLinux
pacman -S openblas xsimd
yay -S sentencepiece-git
```

Successful build generate two executables `slimt-cli` and `slimt-test` for
command-line usage and testing respectively.

```bash
build/bin/slimt-cli \
--root \
--model \
--vocabulary \
--shortlist

build/slimt-test
```
This is still very much a work in progress, towards being able to make
[lemonade](https://github.com/jerinphilip/lemonade) available in distributions.
Help is much appreciated here, please get in touch if you can help here.

### Python

Python bindings to the C++ code are available. Python bindings provide a layer
to download models and use-them via command line entrypoint `slimt` (the core
slimt library only has the inference code).

```bash
python3 -m venv env
source env/bin/activate
python3 -m pip install wheel
python3 setup.py bdist_wheel
python3 -m pip install dist/.whl

# Download en-de-tiny and de-en-tiny models.
slimt download -m en-de-tiny
slimt download -m de-en-tiny
```
Find an example of the built wheel running on colab below:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/12wFMVwOTzOyRjoeWtett2DTDhwNAbvBZ?usp=sharing)

You may pass customizing cmake-variables via `CMAKE_ARGS` environment variable.

```bash
CMAKE_ARGS='-D...' python3 setup.py bdist_wheel
```