https://github.com/harsh-arora-4114/deploy-llm-application-using-docker
Built a containerized LLM application using FastAPI, Hugging Face Transformers, and Docker for scalable text generation. Developed REST APIs and an interactive web interface, enabling seamless deployment and inference across local and cloud environments.
https://github.com/harsh-arora-4114/deploy-llm-application-using-docker
api-rest docker fastapi huggingface-transformers pytorch
Last synced: 18 days ago
JSON representation
Built a containerized LLM application using FastAPI, Hugging Face Transformers, and Docker for scalable text generation. Developed REST APIs and an interactive web interface, enabling seamless deployment and inference across local and cloud environments.
- Host: GitHub
- URL: https://github.com/harsh-arora-4114/deploy-llm-application-using-docker
- Owner: Harsh-Arora-4114
- License: mit
- Created: 2026-06-06T03:19:28.000Z (24 days ago)
- Default Branch: main
- Last Pushed: 2026-06-06T05:08:12.000Z (24 days ago)
- Last Synced: 2026-06-06T06:13:11.371Z (24 days ago)
- Topics: api-rest, docker, fastapi, huggingface-transformers, pytorch
- Language: Python
- Homepage:
- Size: 260 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Deploying a Large Language Model (LLM) Application Using Docker
## Overview
This project focuses on building and deploying a containerized Large Language Model (LLM) inference application using FastAPI, Hugging Face Transformers, and Docker. The system provides a lightweight and scalable platform for running text-generation models through REST APIs and an interactive web interface.
The application integrates:
* Hugging Face Transformers for text generation
* FastAPI for high-performance API serving
* Docker for containerized deployment
* Interactive HTML/CSS/JavaScript frontend
* Health monitoring and model management APIs
* Streaming and standard text generation endpoints
The application allows users to:
* Generate AI-powered text responses
* Customize generation parameters
* Access model information and health status
* Deploy consistently across environments
* Run locally or on cloud platforms such as Hugging Face Spaces
This is a complete AI Deployment + MLOps project suitable for research, academic work, portfolio building, and industry applications.
---
## Objectives
* Build a production-ready LLM inference server
* Deploy AI models using Docker containers
* Provide REST APIs for model interaction
* Develop an interactive web-based chat interface
* Enable scalable and portable AI deployment
* Integrate cloud-ready deployment workflows
---
## Key Features
### LLM Inference
* Text generation using Hugging Face Transformers
* Configurable generation parameters
* Supports different language models through environment variables
### FastAPI Backend
* High-performance asynchronous API framework
* Health monitoring endpoint
* JSON-based request/response handling
### Interactive Web Interface
* Modern responsive UI
* Real-time prompt submission
* Display generated outputs and performance metrics
### Docker Deployment
* Fully containerized architecture
* Consistent deployment across platforms
* Simplified dependency management
### Streaming Support
* Server-Sent Events (SSE) implementation
* Token-by-token response streaming
### API Testing
* Automated health check validation
* Text generation endpoint testing
---
## Technologies Used
| Category | Tools / Libraries |
| -------------------- | ------------------------- |
| Programming Language | Python |
| Backend Framework | FastAPI |
| Machine Learning | Hugging Face Transformers |
| Deep Learning | PyTorch |
| Containerization | Docker |
| API Validation | Pydantic |
| Frontend | HTML, CSS, JavaScript |
| Deployment | Hugging Face Spaces |
| Testing | Python Test Scripts |
---
## Project Architecture
### 1. User Interface
* User submits prompts through web interface
* Configures generation parameters
* Views generated responses
### 2. API Layer
* FastAPI handles incoming requests
* Validates request payloads
* Routes requests to inference engine
### 3. Model Inference
* Hugging Face model loads during startup
* Generates responses based on prompt inputs
* Supports configurable sampling strategies
### 4. Response Processing
* Measures latency
* Calculates generated token count
* Returns structured JSON responses
### 5. Container Deployment
* Application packaged into Docker container
* Portable deployment across systems
* Supports cloud deployment environments
---
## API Endpoints
### GET /health
Returns:
* Server status
* Model information
* Running device (CPU/GPU)
### POST /generate
Generates text responses.
Example Request:
```json
{
"prompt": "Explain machine learning",
"max_new_tokens": 200,
"temperature": 0.7,
"top_p": 0.9
}
```
### POST /generate/stream
Provides streaming text generation using Server-Sent Events (SSE).
---
## How to Run
### 1. Clone Repository
```bash
git clone https://github.com/yourusername/llm-docker-app.git
cd llm-docker-app
```
### 2. Build Docker Image
```bash
docker build -t llm-app .
```
### 3. Run Container
```bash
docker run -p 7860:7860 llm-app
```
### 4. Open Application
```text
http://localhost:7860
```
---
## Alternative Using Docker Compose
```bash
docker-compose up --build
```
---
## Model Customization
Use any Hugging Face causal language model:
```bash
docker run -p 7860:7860 \
-e MODEL_NAME=gpt2-medium \
llm-app
```
Popular models:
* GPT-2
* GPT-2 Medium
* Facebook OPT-125M
* Facebook OPT-350M
* DialoGPT Small
* EleutherAI Pythia
---
## Sample Outputs
### User Prompt
```text
Explain Docker in simple terms.
```
### Generated Response
```text
Docker is a platform that allows developers to package applications and their dependencies into containers, making deployment consistent across different environments.
```
---
## Future Scope
* Multi-turn conversational memory
* Vector database integration
* Retrieval-Augmented Generation (RAG)
* User authentication system
* GPU acceleration optimization
* Kubernetes deployment support
* Model quantization and optimization
* Monitoring and logging dashboards
---
## Applications
* AI Chatbots
* Enterprise Knowledge Assistants
* Educational AI Systems
* Research Prototypes
* Cloud-based AI Services
* MLOps and AI Deployment Demonstrations
---
## License
This project is licensed under the MIT License.
---
## Author
**Developed by Harsh Arora**