Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/satyampurwar/ml-engineering

Developing and deploying machine learning models while adhering to engineering best practices.
https://github.com/satyampurwar/ml-engineering

api api-testing conda-environment configuration-management docker-container docker-image dockerfile jupyter-notebooks logging machine-learning-algorithms mlflow pytest python quality-assurance reproducibility shell-scripting sphinx-documentation

Last synced: about 1 month ago
JSON representation

Developing and deploying machine learning models while adhering to engineering best practices.

Awesome Lists containing this project

README

        

## Machine Learning Engineering

This repository contains the code and documentation for developing and deploying machine learning models while adhering to engineering best practices.

## Environment Setup

### Virtual Environment

- Navigate to the project directory:

```bash
cd /ml-engineering
```

- Create and activate the conda environment:

```bash
conda env create --file deploy/conda/linux_py312.yml
conda activate mle
```

- Manage dependencies:
- Install additional dependencies using conda or pip as needed.
- Update environment file: `conda env export --name mle > deploy/conda/linux_py312.yml`
- Deactivate environment: `conda deactivate`
- Remove environment (if necessary): `conda remove --name mle --all`

## Development Workflow

### Research & Development

- Reference code: `/ml-engineering/reference/nonstandardcode`
- Working notebooks: `/ml-engineering/notebooks/working`

### Script Development

Scripts are derived from working notebooks in `/ml-engineering/notebooks/working`.

### Setting PYTHONPATH

Ensure the directory containing `housing_value` is in PYTHONPATH:

```bash
conda env config vars set PYTHONPATH=$(pwd)/src
conda deactivate
conda activate mle
echo $PYTHONPATH
```

### Integrated Features in Scripts

- Argument Parsing: Uses `argparse` for command-line arguments.
- Configuration Management: Implements `configparser` with `setup.cfg`.
- Logging: Incorporates `logging` for execution tracking and debugging.

### Code Quality Tools

Install required tools:

```bash
sudo apt install black isort flake8
```

| Tool | Description | Usage |
|--------|-----------------|-------------------|
| Black | Code formatter | `black ` |
| isort | Import sorter | `isort ` |
| flake8 | Linter | `flake8 ` |

**Note:** Configurations are specified in `setup.cfg` and `.vscode/settings.json` (for VS Code users).

### Maintaining Code Quality

```bash
chmod +x shell/src_quality.sh
./shell/src_quality.sh
```

### Script Execution

View available options for each script using the `--help` flag:

```bash
python src/housing_value/ingest_data.py --help
python src/housing_value/train.py --help
python src/housing_value/score.py --help
```

## Testing

Install pytest:

```bash
sudo apt install python3-pytest
```

**Note:** Configurations are specified in `setup.cfg`.

Maintain test code quality:

```bash
chmod +x shell/tests_quality.sh
./shell/tests_quality.sh
```

Run tests:

```bash
pytest
pytest /
```

## Documentation

Using Sphinx for documentation generation.

### Prerequisites

1. Install the package:
- Option 1: Editable mode (dependent on pyproject.toml): produces egg-info folder.

```bash
pip install -e .
```

- Option 2: Build and install: produces egg-info folder as well as dist folder containing tar.gz and whl file.

```bash
python3 -m pip install --upgrade build
python3 -m build
pip install dist/housing_value-0.0.0-py3-none-any.whl
```

2. Install Sphinx & Packages for building documentation:

```bash
sudo apt install python3-sphinx
pip install sphinx sphinx-rtd-theme matplotlib
pip install sphinxcontrib-napoleon
```

### Generating Documentation

1. Navigate to the docs directory:

```bash
cd docs
```

2. Check configuration files:
- Make sure to create Makefile.

3. Generate Sphinx project:

```bash
sphinx-quickstart
```

4. Update configuration files:
- Modify `source/conf.py` and `source/index.rst` as needed.
- Reference files are available in the `reference` directory.

5. Generate API documentation:

```bash
sphinx-apidoc -o ./source ../src/housing_value
```

6. Update configuration files:
- Modify `source/housing_value.rst` and `source/index.rst` as needed.
- Reference files are available in the `reference` directory.

7. Build HTML documentation:

```bash
make clean
make html
```

8. Return to the project root:

```bash
cd ..
```

**Note:** The documentation file hierarchy in the `source` directory is: `index.rst > modules.rst > housing_value.rst`.

## Application Packaging with MLflow

**Note:** The file hierarchy for MLflow is structured as follows: `MLproject > app.py`.

1. **Maintaining Code Quality**

```bash
chmod +x shell/app_quality.sh
./shell/app_quality.sh
```

2. **Tracking UI**: Launch the MLflow tracking server using the command.

```bash
mlflow server --backend-store-uri mlruns/ --default-artifact-root mlruns/ --host 127.0.0.1 --port 5000
```

3. **Run Experiment**: Execute an experiment to generate a model artifact with the following command.

```bash
mlflow run . -P
```

The optional parameter `split_size` defaults to `0.2`.

4. **Python Version Management**: Install `pyenv` for managing Python versions and ensuring reproducibility, which facilitates selecting a specific Python version for the project.

```bash
chmod +x shell/pyenv.sh
./shell/pyenv.sh
```

5. **Activate Conda Environment**: Activate the conda environment created during the experiment execution.

6. **Dependency Installation**: Install the required dependency in activated environment.

```bash
pip install virtualenv
```

7. **API Endpoint Generation**: Create an API endpoint to serve the model using -

```bash
mlflow models serve -m mlruns///artifacts/model/ -h 127.0.0.1 -p 1234
```

8. **Testing API Endpoint**: Test the API endpoint from another terminal with the following formats.

- **Datasplit Format**:

```bash
curl -X POST -H "Content-Type: application/json" --data '{"dataframe_split": {"columns": ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income", "ocean_proximity"], "data": [[-118.39, 34.12, 29.0, 6447.0, 1012.0, 2184.0, 960.0, 8.2816, "<1H OCEAN"]]}}' http://127.0.0.1:1234/invocations
```

- **Inputs/Instances Format**:

```bash
curl -X POST -H "Content-Type: application/json" --data '{"inputs": [{"longitude": -118.39, "latitude": 34.12, "housing_median_age": 29.0, "total_rooms": 6447.0, "total_bedrooms": 1012.0, "population": 2184.0, "households": 960.0, "median_income": 8.2816, "ocean_proximity": "<1H OCEAN"}]}' http://127.0.0.1:1234/invocations
```

### Deployment Readiness

To facilitate deployment, Docker images are created by aggregating necessary artifacts and configurations.

1. **Artifact Aggregation:**

- Copy model artifacts (`MLmodel` and `model.pkl`) from `mlruns///artifacts/model` to `/ml-engineering/deploy/docker/mlruns`. Ensure unnecessary metadata is cleaned from the `MLmodel`.

- Transfer the `requirements.txt` file from `mlruns///artifacts/model` to `/ml-engineering/deploy/docker`.

- Move the wheel file (`housing_value-0.0.0-py3-none-any.whl`) from the dist directory to `/ml-engineering/deploy/docker`.

- Copy the `setup.cfg` from the project root to `/ml-engineering/deploy/docker`, ensuring it contains only data required for inference.

2. **Script and Configuration Creation:**

- Develop script `run.sh` to execute MLflow models serve command.

- Create `.dockerignore` file to ignore copying files in WORKDIR of image/container.

- Construct Dockerfile to package all components into a Docker image, ensuring efficient deployment and scalability.

3. **Image Development:**

```bash
cd deploy/docker
```

- **Build With Root User:**

```bash
docker build . -t /mle:rootuser -f Dockerfile.rootuser
```

- **Build Without Root User for Security:** Enhance security by building an image that does not use the root user.

```bash
docker build . -t /mle:nonrootuser -f Dockerfile.nonrootuser
```

- **Use Buildkit for Multistage Builds:** Optimize your image size and build time using Docker Buildkit for multistage builds.

```bash
DOCKER_BUILDKIT=1 docker build . -t /mle:multistage -f Dockerfile.multistage
```

## Container Management

This section provides detailed instructions for containerizing your application using Docker and testing endpoints.

### Starting and Testing a Container

1. **Start the Container:** Use the following command to start a Docker container named `rootuser` and map port 8080 on your host to port 5000 in the container.

```bash
docker run -dit -p 8080:5000 --name rootuser /mle:rootuser
```

2. **Test the Endpoint:** Verify that your application is running correctly by sending a POST request to the endpoint using curl.

```bash
curl -X POST -H "Content-Type: application/json" --data '{"dataframe_split": {"columns": ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income", "ocean_proximity"], "data": [[-118.39, 34.12, 29.0, 6447.0, 1012.0, 2184.0, 960.0, 8.2816, "<1H OCEAN"]]}}' http://127.0.0.1:8080/invocations
```

### Managing Docker Images

1. **Push Image to Docker Hub:** First, log in to Docker Hub and then push images.

```bash
docker login -u
docker push /mle:rootuser
docker push /mle:nonrootuser
docker push /mle:multistage
```

2. **List Images and Containers:** To view all Docker images and containers on system.

- **Images:**

```bash
docker image ls
```

- **Containers:**

```bash
docker ps --all
```

3. **View Logs:** Access the logs of a running container.

```bash
docker logs
```

4. **Delete Containers and Images:** Remove a specific container or image using these commands:

- **Containers:**

```bash
docker rm -f
```

- **Images:**

```bash
docker rmi
```

### Retesting in a New Environment

To test your application in a new environment:

1. **Pull Image from Docker Hub:**

```bash
docker pull /mle:rootuser
```

2. **Start the Container Again:**

```bash
docker run -dit -p 8080:5000 --name rootuser /mle:rootuser
```

3. **Re-test the Endpoint:** Use the same curl command as before to verify functionality.