Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zml/zml
Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild
https://github.com/zml/zml
ai bazel hpc inference xla zig
Last synced: 5 days ago
JSON representation
Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild
- Host: GitHub
- URL: https://github.com/zml/zml
- Owner: zml
- License: apache-2.0
- Created: 2024-09-17T09:13:32.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2025-01-09T16:19:34.000Z (12 days ago)
- Last Synced: 2025-01-09T20:17:00.878Z (12 days ago)
- Topics: ai, bazel, hpc, inference, xla, zig
- Language: Zig
- Homepage: https://docs.zml.ai
- Size: 1.25 MB
- Stars: 1,824
- Watchers: 25
- Forks: 66
- Open Issues: 24
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - zml/zml
- Awesome-LLMOps - zml
- Awesome-LLMOps - zml
README
[ZML]: https://zml.ai/
[Getting Started]: #getting-started
[Documentation]: https://docs.zml.ai
[Contributing]: ./CONTRIBUTING.md
[Discord]: https://discord.gg/6y72SN2E7H# Bonjour 👋
At ZML, we are creating exciting AI products on top of our high-performance
AI inference stack. Our stack is built for production, using the amazing
[Zig](https://ziglang.org) language, [MLIR](https://mlir.llvm.org), and the
power of [Bazel](https://bazel.build).---
# We're happy to share!
We're very happy to share our inference stack with the World and hope it allows
you, too, to build cool and exciting AI projects.To give you a glimpse of what you can do with ZML, here is an early demo:
It shows a prototype running a LLaMA2 model sharded on 1 NVIDIA RTX 4090, 1 AMD
6800XT, and 1 Google Cloud TPU v2. All accelerators were hosted in different
locations, with activations being passed over a VPN.All processes used the same model code, cross-compiled on a Mac, and copied onto
the servers.For more inspiration, see also the examples below or check out the
[examples](./examples) folder.# Getting started
## Prerequisites
We use `bazel` to build ZML and its dependencies. The only prerequisite is
`bazel`, which we recommend to download through `bazelisk`, a version manager
for `bazel`.**Please note: If you do not wish to install `bazel`** system-wide, we provide
[examples/bazel.sh](examples/bazel.sh) which downloads it to your home folder
and runs it.**Install Bazel** (recommended):
### macOS
```
brew install bazelisk
```### Linux
```
curl -L -o /usr/local/bin/bazel 'https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64'
chmod +x /usr/local/bin/bazel
```## Run a pre-packaged model
We have implemented a variety of example models in ZML. See our reference
implementations in the
[examples](https://github.com/zml/zml/tree/master/examples/) folder.### MNIST
The [classic](https://en.wikipedia.org/wiki/MNIST_database) handwritten digits
recognition task. The model is tasked to recognize a handwritten digit, which
has been converted to a 28x28 pixel monochrome image. `Bazel` will download a
pre-trained model, and the test dataset. The program will load the model,
compile it, and classify a randomly picked example from the test dataset.On the command line:
```
cd examples
bazel run -c opt //mnist# or
./bazel.sh run -c opt //mnist
```### TinyLlama, Stories 15M
Our LLM examples start with a small model trained specifically on children's
history books. This model has been trained by [Andrej
Karpathy](https://x.com/karpathy); you can read more about it on his
[GitHub](https://github.com/karpathy/llama2.c).```
cd examples
bazel run -c opt //llama:TinyLlama-Stories-15M
bazel run -c opt //llama:TinyLlama-Stories-15M -- --prompt="Once upon a time, there was a cute little dragon"
```### OpenLLama 3B
```
cd examples
bazel run -c opt //llama:OpenLLaMA-3B
bazel run -c opt //llama:OpenLLaMA-3B -- --prompt="Once upon a time,"
```### Meta Llama 3.1 8B
This model has restrictions, see
[here](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). It **requires
approval from Meta on Huggingface**, which can take a few hours to get granted.While waiting, you can already generate an access token to log into HuggingFace
from `bazel`; see [here](./docs/huggingface-access-token.md).Once you've been granted access, you're ready to download a gated model like
`Meta-Llama-3.1-8B-Instruct`!```
# requires token in $HOME/.cache/huggingface/token, as created by the
# `huggingface-cli login` command, or the `HUGGINGFACE_TOKEN` environment variable.
cd examples
bazel run -c opt //llama:Llama-3.1-8B-Instruct
bazel run -c opt //llama:Llama-3.1-8B-Instruct -- --prompt="Once upon a time,"
```You can also try Llama-3.1-70B-Instruct if you have enough memory.
### Meta Llama 3.2 1B
Like the 8B model above, this model also requires approval. See
[here](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) for access requirements.```
cd examples
bazel run -c opt //llama:Llama-3.2-1B-Instruct
bazel run -c opt //llama:Llama-3.2-1B-Instruct -- --prompt="Once upon a time,"
```For a larger 3.2 model, you can also try Llama-3.2-3B-Instruct.
## Running Models on GPU / TPU
You can compile models for accelerator runtimes by appending one or more of the
following arguments to the command line when compiling / running a model:- NVIDIA CUDA: `--@zml//runtimes:cuda=true`
- AMD RoCM: `--@zml//runtimes:rocm=true`
- Google TPU: `--@zml//runtimes:tpu=true`
- AWS Trainium/Inferentia 2: `--@zml//runtimes:neuron=true`
- **AVOID CPU:** `--@zml//runtimes:cpu=false`The latter, avoiding compilation for CPU, cuts down compilation time.
So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU,
run the following:```
cd examples
bazel run -c opt //llama:OpenLLaMA-3B \
--@zml//runtimes:cuda=true \
-- --prompt="Once upon a time,"
```## Run Tests
```
bazel test //zml:test
```# A taste of ZML
## MNIST
```zig
const std = @import("std");
const zml = @import("zml");/// Model definition
const Mnist = struct {
fc1: Layer,
fc2: Layer,const Layer = struct {
weight: zml.Tensor,
bias: zml.Tensor,pub fn forward(self: Layer, input: zml.Tensor) zml.Tensor {
return self.weight.matmul(input).add(self.bias).relu();
}
};/// just two linear layers + relu activation
pub fn forward(self: Mnist, input: zml.Tensor) zml.Tensor {
std.log.info("Compiling for target: {s}", .{@tagName(input.getContext().target())});
var x = input.flattenAll().convert(.f32);
const layers: []const Layer = &.{ self.fc1, self.fc2 };
for (layers) |layer| {
x = zml.call(layer, .forward, .{x});
}
return x.argMax(0, .u8).indices;
}
};
```## Tagged Tensors
```zig
const Sdpa = struct {
pub fn forward(_: Sdpa, ctx: *zml.Context, q_: zml.Tensor, k_: zml.Tensor, v_: zml.Tensor) zml.Tensor {
const q = q_.withTags(.{ .b, .h, .q, .hd });
const k = k_.withTags(.{ .b, .h, .k, .hd });
const v = v_.withTags(.{ .b, .h, .k, .hd });
const attn_mask = zml.nn.causalAttnMask(ctx, .{ .q = q.dim(.q), .k = k.dim(.k) }, q.dtype(), null);
return zml.nn.sdpa(ctx, q, k, v, .{ .attn_mask = attn_mask });
}
};
```# Where to go next:
You might want to check out more [examples](./examples), read through the
[documentation directly on GitHub](./docs/README.md), or, for the full rendering
experience, browse the
[online documentation with included API reference](https://docs.zml.ai).# Contributing
See [here][Contributing].
# License
ZML is licensed under the [Apache 2.0 license](./LICENSE).
# Thanks to our contributors