https://github.com/idsia/mlprod
Machine Learning in Production
https://github.com/idsia/mlprod
distributed-systems machine-learning mlops
Last synced: 8 months ago
JSON representation
Machine Learning in Production
- Host: GitHub
- URL: https://github.com/idsia/mlprod
- Owner: IDSIA
- Created: 2023-02-09T10:38:52.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-02-10T08:59:28.000Z (over 3 years ago)
- Last Synced: 2025-04-05T09:23:14.635Z (about 1 year ago)
- Topics: distributed-systems, machine-learning, mlops
- Language: Jupyter Notebook
- Homepage: https://machine-learning-in-production.readthedocs.io/en/latest/index.html
- Size: 1.06 MB
- Stars: 5
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Machine Learning in Production
## Run the dockers
### Setup
In order to work as intended, the docker-compose stack requires some setup:
* A docker network named `www`. Use the following command to create it:
```
docker network create www
```
* A [Traefik](https://doc.traefik.io/traefik/) service working on the `www` network.
_Traefik_ is a service that is capable of routing requests to web sub-domain to services built using docker. We are using it just for this purpose, although it can also perform other tasks.
To create this service, check the file `extra/docker-compose.traefik.yaml`.
* A `.env` file need to be created first. This file is not included in the repository since it is server-dependant.
The content is the following:
```
DOMAIN=
CELERY_BROKER_URL=pyamqp://rabbitmq/
CELERY_BACKEND_URL=redis://redis/
CELERY_QUEUE=
DATABASE_SCHEMA=mlpdb
DATABASE_USER=mlp
DATABASE_PASS=mlp
DATABASE_HOST=database
DATABASE_URL=postgresql://${DATABASE_USER}:${DATABASE_PASS}@${DATABASE_HOST}/${DATABASE_SCHEMA}
GRAFANA_ADMIN_PASS=grafana
```
Remember that these password are written in a non-encripted way. This is **not** a safe solution.
### Execute the docker
Then launch the docker through the docker compose, execute the following command from the root directory of this repository:
```
docker-compose up -d
```
## Generate data
This proof-of-concept software use synthetic data generated by sampling some distributions. To generate these data, just rund the following command and it will populate the `/dataset` folder with TSV (Tab Separated Value) files.
```
python dataset_generator.py
```
## Generate traffic
In order to simulate the use the application from of external users, the script `traffic_generator.py` can be used.
Basic command to execute with default parameters is
```
python traffic_generator.py
```
Some parameters can be used to control the behavior of the users:
* `--config ` is a path to a configuration file. A configuration file is a `.tsv` (Tab Separated Value) file that contains all the parameters for the `UserData` and `UserLabeller` behavior. See the files `config/user.tsv` and `config/user_noise.tsv` for some examples.
* `-p` number of parallel thread to run. Each thread will contact the application independently.
* `-d` probability to have a response. If set to 1.0, it is certain that there will always be a response. If set to 0.0, the user will never set a response.
* To control the waiting time use the `-tmin` and `-tmax` parameters. The number is expressed in seconds. For less than a second use decimals (i.e. 100ms is written as 0.1).
`-tmin` is the minimum amount of time to wait after a request to the application.
`-tmax` maximum amount of time to wait after a request to the application. The wait is randomly choosed between the `-tmin` and `-tmax` values. Higher values mean a slow generation of new cdata. Bigger is the difference between these two parameters and higher is the variance in the waiting time.
## Development
To develop this application, a [Python virutal environmnet](https://docs.python.org/3/tutorial/venv.html) is highly recommended. If a development machine with Docker is not available, it is possible to use the three `requirements.txt` file to create a fully working environment:
* `requirements.api.txt` contains all the packages for the API service,
* `requirements.worker.txt` contains all the packages for the Celery worker service,
* `requirements.txt` contains extra packages and utilities required by scripts or for the development.
To create a virtual environment using the `python-venv` package, use the following command:
```
python -m venv MLPenv
```
Then remember to **activate** the environment before launching the scripts:
```
source ./MLPenv/bin/activate
```
## References
### FastAPI and database interaction
* [SQL (Relational) Databases](https://fastapi.tiangolo.com/tutorial/sql-databases/)
* [Python ML in Production - Part 1: FastAPI + Celery with Docker](https://denisbrogg.hashnode.dev/python-ml-in-production-part-1-fastapi-celery-with-docker)
* [First Steps with Celery](https://docs.celeryq.dev/en/stable/getting-started/first-steps-with-celery.html)
* [Next Steps](https://docs.celeryq.dev/en/stable/getting-started/next-steps.html)
* [Serving ML Models in Production with FastAPI and Celery](https://towardsdatascience.com/deploying-ml-models-in-production-with-fastapi-and-celery-7063e539a5db)
* [Multi-stage builds #2: Python specifics](https://pythonspeed.com/articles/multi-stage-docker-python/#solution2-virtualenv)
* [SQLAlchemy ORM — a more “Pythonic” way of interacting with your database](https://medium.com/dataexplorations/sqlalchemy-orm-a-more-pythonic-way-of-interacting-with-your-database-935b57fd2d4d)
* [Events: startup - shutdown](https://fastapi.tiangolo.com/advanced/events/)
### Metrics with Prometheus
* [Overview | Prometheus](https://prometheus.io/docs/introduction/overview/)
* [Instrumentation | Prometheus](https://prometheus.io/docs/practices/instrumentation/#counter-vs-gauge-summary-vs-histogram)
* [prometheus/client_python | GitHub](https://github.com/prometheus/client_python)
* [kozhushman/prometheusrock | GitHub](https://github.com/kozhushman/prometheusrock)
### Grafana
* [Provision Grafana](https://grafana.com/docs/grafana/latest/administration/provisioning/)
* [Data Source on Startup](https://community.grafana.com/t/data-source-on-startup/8618/2)
* [Authentication API](https://grafana.com/docs/grafana/latest/developers/http_api/auth/)
## Disclaimer
This software was build as proof-of-concept and as a support material for the course _Machine Learning in Production_.
It is not intended to be used in a real production system, although some state-of-the-art best practice has been followed to implement it.