https://github.com/juliensimon/sagemaker-inference-container-cpu
An Amazon SageMaker Container for Hugging Face Inference on Graviton and Intel CPUs
https://github.com/juliensimon/sagemaker-inference-container-cpu
amd64 arm64 aws docker docker-compose graviton helm huggingface inference intel kubernetes-deployment llamacpp local-deployment python sagemaker
Last synced: 3 months ago
JSON representation
An Amazon SageMaker Container for Hugging Face Inference on Graviton and Intel CPUs
- Host: GitHub
- URL: https://github.com/juliensimon/sagemaker-inference-container-cpu
- Owner: juliensimon
- License: other
- Created: 2025-08-12T09:08:21.000Z (10 months ago)
- Default Branch: master
- Last Pushed: 2025-10-06T07:52:28.000Z (9 months ago)
- Last Synced: 2026-02-08T02:21:37.740Z (4 months ago)
- Topics: amd64, arm64, aws, docker, docker-compose, graviton, helm, huggingface, inference, intel, kubernetes-deployment, llamacpp, local-deployment, python, sagemaker
- Language: Python
- Homepage:
- Size: 118 KB
- Stars: 11
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README-multiarch.md
- License: LICENSE
Awesome Lists containing this project
README
# Multi-Architecture AFM-4.5B Inference Container
This repository provides a multi-architecture Docker container for running the AFM-4.5B model on both ARM64 and AMD64 platforms with architecture-specific optimizations.
## ๐๏ธ Repository Structure
```
sagemaker-inference-container-graviton/
โโโ docker/
โ โโโ arm64/ # ARM64-specific configurations
โ โ โโโ Dockerfile # ARM64-optimized Dockerfile
โ โ โโโ docker-compose.yml # ARM64-specific compose file
โ โโโ amd64/ # AMD64/Intel-specific configurations
โ โ โโโ Dockerfile # AMD64-optimized Dockerfile
โ โ โโโ docker-compose.yml # AMD64-specific compose file
โ โโโ multiarch/ # Multi-architecture configurations
โ โโโ Dockerfile # Multi-arch Dockerfile
โ โโโ docker-compose.yml # Multi-arch compose file
โโโ scripts/
โ โโโ build-multiarch.sh # Build for all architectures
โ โโโ build-arm64.sh # Build for ARM64 only
โ โโโ build-amd64.sh # Build for AMD64 only
โ โโโ detect-architecture.sh # Auto-detect and configure
โโโ config/
โ โโโ arm64/ # ARM64-specific build configs
โ โโโ amd64/ # AMD64-specific build configs
โ โโโ common/ # Shared configurations
โโโ docs/
โ โโโ arm64-setup.md # ARM64 setup guide
โ โโโ amd64-setup.md # AMD64 setup guide
โ โโโ multiarch-deployment.md # Multi-arch deployment guide
โโโ app/ # Shared application code
โโโ tests/ # Architecture-specific tests
```
## ๐ Quick Start
### 1. Auto-Detect Your Architecture
```bash
# This will automatically configure everything for your platform
source scripts/detect-architecture.sh
```
### 2. Build for Your Platform
```bash
# Build for your detected architecture
./scripts/build-$ARCH_NAME.sh
# Or build for all architectures
./scripts/build-multiarch.sh
```
### 3. Run the Service
```bash
# First run (download, convert, quantize)
docker-compose -f $COMPOSE_FILE --profile first-run up --build afm-first-run
# Subsequent runs (fast startup)
docker-compose -f $COMPOSE_FILE --profile fast up afm-fast
```
## ๐ Prerequisites
- Docker and Docker Compose installed
- HuggingFace token for AFM-4.5B (gated model)
- Sufficient disk space (~15GB for full model + conversions)
## ๐ง Build Options
### Single Architecture Build
```bash
# ARM64 only
./scripts/build-arm64.sh
# AMD64 only
./scripts/build-amd64.sh
```
## ๐ Deployment Options
### Option 1: Auto-Detection (Recommended)
```bash
source scripts/detect-architecture.sh
docker-compose -f $COMPOSE_FILE --profile fast up afm-fast
```
### Option 2: Manual Selection
```bash
# ARM64
docker-compose -f docker/arm64/docker-compose.yml --profile fast up afm-fast
# AMD64
docker-compose -f docker/amd64/docker-compose.yml --profile fast up afm-fast
## ๐ Performance Comparison
| Metric | ARM64 | AMD64 | Notes |
|--------|-------|-------|-------|
| Build Time | ~15-20 min | ~10-15 min | AMD64 typically faster |
| Startup Time | ~30-45s | ~25-35s | Depends on hardware |
| Inference Speed | ~12-20 tokens/s | ~15-25 tokens/s | CPU-dependent |
| Memory Usage | ~8GB | ~8GB | Similar across platforms |
| Power Efficiency | Better | Good | ARM64 more efficient |
## ๐งช Testing
### Health Check
```bash
curl http://localhost:8080/ping
```
### API Test
```bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 50,
"temperature": 0.7
}'
```
## ๐ Documentation
- [ARM64 Setup Guide](docs/arm64-setup.md) - Detailed ARM64 setup and optimization
- [AMD64 Setup Guide](docs/amd64-setup.md) - Detailed AMD64 setup and optimization
- [Original Docker Compose Guide](README-docker-compose.md) - Original setup guide
## ๐ Troubleshooting
### Common Issues
1. **Build failures**: Ensure you have the correct Docker platform support
2. **Performance issues**: Check thread count and memory allocation
3. **Model loading errors**: Verify sufficient disk space and memory
### Debug Commands
```bash
# Check architecture
uname -m
# Check Docker platform
docker version
# Check container logs
docker-compose -f $COMPOSE_FILE logs afm-fast
# Check resource usage
docker stats
```
## ๐ค Contributing
When contributing to this multi-architecture setup:
1. **Test on both platforms**: Ensure changes work on ARM64 and AMD64
2. **Update documentation**: Keep architecture-specific guides current
3. **Add tests**: Include tests for both architectures
4. **Performance testing**: Benchmark changes on both platforms
## ๐ License
This project is licensed under the same terms as the original repository.
## ๐ Acknowledgments
- Original AFM-4.5B model by Arcee AI
- llama.cpp for the inference engine