https://github.com/parthapray/llm_dynamic_load_unload
This repo contains codes for dynamic load and unload llms on localized device
https://github.com/parthapray/llm_dynamic_load_unload
edge flask large-language-models load ollama raspberrypi4 unload
Last synced: 9 days ago
JSON representation
This repo contains codes for dynamic load and unload llms on localized device
- Host: GitHub
- URL: https://github.com/parthapray/llm_dynamic_load_unload
- Owner: ParthaPRay
- License: mit
- Created: 2025-01-06T07:26:00.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-06T16:13:28.000Z (over 1 year ago)
- Last Synced: 2025-02-27T01:16:28.436Z (over 1 year ago)
- Topics: edge, flask, large-language-models, load, ollama, raspberrypi4, unload
- Language: Python
- Homepage:
- Size: 21.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# llm_dynamic_load_unload
This repo contains codes for dynamic load and unload llms on localized edge device
This script provides a Flask-based system to manage LLM model tasks by continous load and unloading on localized edge device using Ollama and Raspberry Pi 4B, focusing on efficient model usage and logging. Below is a breakdown of its main components:
Install Ollama https://github.com/ollama/ollama
ollama run
Alyways perform the code under virtual environment.
Then pip install -r requirements.txt
Then run llm_basic_scheduling_switch_7.py in a terminal
Then run test.py on other terminal
Check the CSV logs.
### **Key Features**
1. **Model Management**:
- Loads and unloads models based on usage.
- Maintains an active model to reduce loading times.
- Automatically unloads models idle beyond a configurable `MODEL_TIMEOUT`.
2. **Task Queue**:
- Tasks (e.g., arithmetic operations) are added to a queue and processed sequentially.
- Ensures tasks are assigned to appropriate models.
3. **System Resource Monitoring**:
- Tracks CPU usage, memory usage, and system load averages using `psutil`.
- Logs these metrics for performance analysis.
4. **Logging**:
- Logs detailed task metrics, including timestamps, task latencies, resource usage, and model states, into a CSV file (`llm_metrics.csv`).
- Supports debugging with optional console logs.
5. **REST API**:
- Provides an endpoint (`/perform_task`) to add tasks via HTTP POST requests.
- Accepts JSON payloads specifying the task type, model, and prompt.
6. **Concurrency**:
- Processes tasks and manages models concurrently using threads and thread-safe mechanisms like `Lock`.
7. **Error Handling**:
- Handles failed model loading or task execution gracefully.
- Logs failures and returns appropriate error messages to clients.
---
### **Important Components**
1. **Constants**:
- `BASE_URL`: Base URL for model interactions.
- `MODEL_TIMEOUT`: Duration (in seconds) after which idle models are unloaded.
- `DEBUG`: Enables debug logs for troubleshooting.
- `LOG_FILE`: Path to the CSV log file.
2. **Key Functions**:
- `debug_log`: Logs messages if debugging is enabled.
- `log_to_csv`: Writes task and system performance metrics to a CSV file.
- `monitor_resources`: Captures system resource usage metrics.
- `load_model`: Loads a specified model and tracks its state.
- `unload_idle_models`: Unloads models that haven't been used recently.
- `process_tasks`: Main loop to process queued tasks and log their performance.
3. **REST Endpoint**:
- `/perform_task`: Accepts task details and adds them to the queue.
4. **Background Thread**:
- A daemon thread (`processor_thread`) runs the `process_tasks` function continuously.
---
### **Example Use Case**
1. **Add a Task**:
A client sends a POST request to `/perform_task` with:
curl -X POST http://localhost:5000/perform_task -H "Content-Type: application/json" -d '{"task_type": "arithmetic", "model_name": "qwen2.5:0.5b-instruct", "prompt": "What is 2+2?"}'
2. **Process Task**:
- The task is added to the queue.
- The system ensures the model is loaded, executes the task, and logs the result.
3. **Resource Monitoring**:
- During task execution, CPU, memory, and load averages are monitored.
4. **Unload Idle Models**:
- Models unused for `MODEL_TIMEOUT` are unloaded to conserve resources.
This system is ideal for scenarios requiring efficient AI model usage, such as dynamically handling multiple models with varying tasks while monitoring and logging system performance.