https://github.com/ai-hypercomputer/jetstream
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
https://github.com/ai-hypercomputer/jetstream
gemma gpt gpu inference jax large-language-models llama llama2 llm llm-inference llmops mlops model-serving pytorch tpu transformer
Last synced: 14 days ago
JSON representation
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
- Host: GitHub
- URL: https://github.com/ai-hypercomputer/jetstream
- Owner: AI-Hypercomputer
- License: apache-2.0
- Created: 2024-03-01T00:24:07.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-21T23:34:44.000Z (about 1 month ago)
- Last Synced: 2025-03-22T02:42:31.098Z (about 1 month ago)
- Topics: gemma, gpt, gpu, inference, jax, large-language-models, llama, llama2, llm, llm-inference, llmops, mlops, model-serving, pytorch, tpu, transformer
- Language: Python
- Homepage:
- Size: 5.32 MB
- Stars: 299
- Watchers: 18
- Forks: 37
- Open Issues: 22
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
- Authors: AUTHORS
Awesome Lists containing this project
README
[](https://github.com/google/JetStream/actions/workflows/unit_tests.yaml?query=branch:main)
[](https://badge.fury.io/py/google-jetstream)
[](https://pypi.org/project/google-jetstream/)
[](CONTRIBUTING.md)# JetStream is a throughput and memory optimized engine for LLM inference on XLA devices.
## About
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
## JetStream Engine Implementation
Currently, there are two reference engine implementations available -- one for Jax models and another for Pytorch models.
### Jax
- Git: https://github.com/google/maxtext
- README: https://github.com/google/JetStream/blob/main/docs/online-inference-with-maxtext-engine.md### Pytorch
- Git: https://github.com/google/jetstream-pytorch
- README: https://github.com/google/jetstream-pytorch/blob/main/README.md## Documentation
- [Online Inference with MaxText on v5e Cloud TPU VM](https://cloud.google.com/tpu/docs/tutorials/LLM/jetstream) [[README](https://github.com/google/JetStream/blob/main/docs/online-inference-with-maxtext-engine.md)]
- [Online Inference with Pytorch on v5e Cloud TPU VM](https://cloud.google.com/tpu/docs/tutorials/LLM/jetstream-pytorch) [[README](https://github.com/google/jetstream-pytorch/tree/main?tab=readme-ov-file#jetstream-pytorch)]
- [Serve Gemma using TPUs on GKE with JetStream](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-tpu-jetstream)
- [Benchmark JetStream Server](https://github.com/google/JetStream/blob/main/benchmarks/README.md)
- [Observability in JetStream Server](https://github.com/google/JetStream/blob/main/docs/observability-prometheus-metrics-in-jetstream-server.md)
- [Profiling in JetStream Server](https://github.com/google/JetStream/blob/main/docs/profiling-with-jax-profiler-and-tensorboard.md)
- [JetStream Standalone Local Setup](#jetstream-standalone-local-setup)# JetStream Standalone Local Setup
## Getting Started
### Setup
```
make install-deps
```### Run local server & Testing
Use the following commands to run a server locally:
```
# Start a server
python -m jetstream.core.implementations.mock.server# Test local mock server
python -m jetstream.tools.requester# Load test local mock server
python -m jetstream.tools.load_tester```
### Test core modules
```
# Test JetStream core orchestrator
python -m unittest -v jetstream.tests.core.test_orchestrator# Test JetStream core server library
python -m unittest -v jetstream.tests.core.test_server# Test mock JetStream engine implementation
python -m unittest -v jetstream.tests.engine.test_mock_engine# Test mock JetStream token utils
python -m unittest -v jetstream.tests.engine.test_token_utils
python -m unittest -v jetstream.tests.engine.test_utils```