An open API service indexing awesome lists of open source software.

https://github.com/juliensimon/sagemaker-inference-container-cpu

An Amazon SageMaker Container for Hugging Face Inference on Graviton and Intel CPUs
https://github.com/juliensimon/sagemaker-inference-container-cpu

amd64 arm64 aws docker docker-compose graviton helm huggingface inference intel kubernetes-deployment llamacpp local-deployment python sagemaker

Last synced: 3 months ago
JSON representation

An Amazon SageMaker Container for Hugging Face Inference on Graviton and Intel CPUs

Awesome Lists containing this project

README

          

# Multi-Architecture AFM-4.5B Inference Container

This repository provides a multi-architecture Docker container for running the AFM-4.5B model on both ARM64 and AMD64 platforms with architecture-specific optimizations.

## ๐Ÿ—๏ธ Repository Structure

```
sagemaker-inference-container-graviton/
โ”œโ”€โ”€ docker/
โ”‚ โ”œโ”€โ”€ arm64/ # ARM64-specific configurations
โ”‚ โ”‚ โ”œโ”€โ”€ Dockerfile # ARM64-optimized Dockerfile
โ”‚ โ”‚ โ””โ”€โ”€ docker-compose.yml # ARM64-specific compose file
โ”‚ โ”œโ”€โ”€ amd64/ # AMD64/Intel-specific configurations
โ”‚ โ”‚ โ”œโ”€โ”€ Dockerfile # AMD64-optimized Dockerfile
โ”‚ โ”‚ โ””โ”€โ”€ docker-compose.yml # AMD64-specific compose file
โ”‚ โ””โ”€โ”€ multiarch/ # Multi-architecture configurations
โ”‚ โ”œโ”€โ”€ Dockerfile # Multi-arch Dockerfile
โ”‚ โ””โ”€โ”€ docker-compose.yml # Multi-arch compose file
โ”œโ”€โ”€ scripts/
โ”‚ โ”œโ”€โ”€ build-multiarch.sh # Build for all architectures
โ”‚ โ”œโ”€โ”€ build-arm64.sh # Build for ARM64 only
โ”‚ โ”œโ”€โ”€ build-amd64.sh # Build for AMD64 only
โ”‚ โ””โ”€โ”€ detect-architecture.sh # Auto-detect and configure
โ”œโ”€โ”€ config/
โ”‚ โ”œโ”€โ”€ arm64/ # ARM64-specific build configs
โ”‚ โ”œโ”€โ”€ amd64/ # AMD64-specific build configs
โ”‚ โ””โ”€โ”€ common/ # Shared configurations
โ”œโ”€โ”€ docs/
โ”‚ โ”œโ”€โ”€ arm64-setup.md # ARM64 setup guide
โ”‚ โ”œโ”€โ”€ amd64-setup.md # AMD64 setup guide
โ”‚ โ””โ”€โ”€ multiarch-deployment.md # Multi-arch deployment guide
โ”œโ”€โ”€ app/ # Shared application code
โ””โ”€โ”€ tests/ # Architecture-specific tests
```

## ๐Ÿš€ Quick Start

### 1. Auto-Detect Your Architecture

```bash
# This will automatically configure everything for your platform
source scripts/detect-architecture.sh
```

### 2. Build for Your Platform

```bash
# Build for your detected architecture
./scripts/build-$ARCH_NAME.sh

# Or build for all architectures
./scripts/build-multiarch.sh
```

### 3. Run the Service

```bash
# First run (download, convert, quantize)
docker-compose -f $COMPOSE_FILE --profile first-run up --build afm-first-run

# Subsequent runs (fast startup)
docker-compose -f $COMPOSE_FILE --profile fast up afm-fast
```

## ๐Ÿ“‹ Prerequisites

- Docker and Docker Compose installed
- HuggingFace token for AFM-4.5B (gated model)
- Sufficient disk space (~15GB for full model + conversions)

## ๐Ÿ”ง Build Options

### Single Architecture Build
```bash
# ARM64 only
./scripts/build-arm64.sh

# AMD64 only
./scripts/build-amd64.sh
```

## ๐Ÿš€ Deployment Options

### Option 1: Auto-Detection (Recommended)
```bash
source scripts/detect-architecture.sh
docker-compose -f $COMPOSE_FILE --profile fast up afm-fast
```

### Option 2: Manual Selection
```bash
# ARM64
docker-compose -f docker/arm64/docker-compose.yml --profile fast up afm-fast

# AMD64
docker-compose -f docker/amd64/docker-compose.yml --profile fast up afm-fast

## ๐Ÿ“Š Performance Comparison

| Metric | ARM64 | AMD64 | Notes |
|--------|-------|-------|-------|
| Build Time | ~15-20 min | ~10-15 min | AMD64 typically faster |
| Startup Time | ~30-45s | ~25-35s | Depends on hardware |
| Inference Speed | ~12-20 tokens/s | ~15-25 tokens/s | CPU-dependent |
| Memory Usage | ~8GB | ~8GB | Similar across platforms |
| Power Efficiency | Better | Good | ARM64 more efficient |

## ๐Ÿงช Testing

### Health Check
```bash
curl http://localhost:8080/ping
```

### API Test
```bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 50,
"temperature": 0.7
}'
```

## ๐Ÿ“š Documentation

- [ARM64 Setup Guide](docs/arm64-setup.md) - Detailed ARM64 setup and optimization
- [AMD64 Setup Guide](docs/amd64-setup.md) - Detailed AMD64 setup and optimization
- [Original Docker Compose Guide](README-docker-compose.md) - Original setup guide

## ๐Ÿ” Troubleshooting

### Common Issues

1. **Build failures**: Ensure you have the correct Docker platform support
2. **Performance issues**: Check thread count and memory allocation
3. **Model loading errors**: Verify sufficient disk space and memory

### Debug Commands

```bash
# Check architecture
uname -m

# Check Docker platform
docker version

# Check container logs
docker-compose -f $COMPOSE_FILE logs afm-fast

# Check resource usage
docker stats
```

## ๐Ÿค Contributing

When contributing to this multi-architecture setup:

1. **Test on both platforms**: Ensure changes work on ARM64 and AMD64
2. **Update documentation**: Keep architecture-specific guides current
3. **Add tests**: Include tests for both architectures
4. **Performance testing**: Benchmark changes on both platforms

## ๐Ÿ“„ License

This project is licensed under the same terms as the original repository.

## ๐Ÿ™ Acknowledgments

- Original AFM-4.5B model by Arcee AI
- llama.cpp for the inference engine