https://github.com/eljandoubi/genre_classification

Create an ML pipeline for Genre Classification using MLflow.
https://github.com/eljandoubi/genre_classification

hydra machine-learning mlflow numpy pandas pandas-profiling pytest scikit-learn scipy wandb

Last synced: 9 days ago
JSON representation

Create an ML pipeline for Genre Classification using MLflow.

Host: GitHub
URL: https://github.com/eljandoubi/genre_classification
Owner: eljandoubi
License: apache-2.0
Created: 2023-08-29T14:07:15.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-09-03T11:34:15.000Z (about 2 years ago)
Last Synced: 2025-10-11T04:37:23.816Z (9 days ago)
Topics: hydra, machine-learning, mlflow, numpy, pandas, pandas-profiling, pytest, scikit-learn, scipy, wandb
Language: Python
Homepage:
Size: 57.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# genre_classification
The primary objective of this project is to develop a machine learning pipeline capable of accurately classifying the genre of songs.

## Table of contents

- [Introduction](#genre_classification)
- [Preliminary steps](#preliminary-steps)
* [Clone repository](clone-repository)
* [Create environment](#create-environment)
* [Get API key for Weights and Biases](#get-api-key-for-weights-and-biases)
* [The configuration](#the-configuration)
* [Running the entire pipeline or just a selection of steps](#Running-the-entire-pipeline-or-just-a-selection-of-steps)
- [Licence](#license)

## Preliminary steps
### Clone repository

Clone the repository locally so you can start working on it:

```
git clone https://github.com/eljandoubi/genre_classification.git
```

and go into the repository:

```
cd genre_classification
```

### Create environment
Make sure to have conda installed and ready, then create a new environment using the ``environment.yml``
file provided in the root of the repository and activate it:

```bash
> conda env create -f environment.yml
> conda activate genre_classification
```

### Get API key for Weights and Biases
Let's make sure we are logged in to Weights & Biases. Get your API key from W&B by going to
[https://wandb.ai/authorize](https://wandb.ai/authorize) and click on the + icon (copy to clipboard),
then paste your key into this command:

```bash
> wandb login [your API key]
```

You should see a message similar to:
```
wandb: Appending key for api.wandb.ai to your netrc file: /home/[your username]/.netrc
```
### The configuration
As usual, the parameters controlling the pipeline are defined in the ``config.yaml`` file defined in
the root. We will use Hydra to manage this configuration file.
Open this file and get familiar with its content. Remember: this file is only read by the ``main.py`` script
(i.e., the pipeline) and its content is
available with the ``go`` function in ``main.py`` as the ``config`` dictionary. For example,
the name of the project is contained in the ``project_name`` key under the ``main`` section in
the configuration file. It can be accessed from the ``go`` function as
``config["main"]["project_name"]``.

NOTE: do NOT hardcode any parameter when writing the pipeline. All the parameters should be
accessed from the configuration file.

### Running the entire pipeline or just a selection of steps
In order to run the pipeline when you are developing, you need to be in the root of the starter kit,
then you can execute as usual:

```bash
> mlflow run .
```
This will run the entire pipeline.

When developing it is useful to be able to run one step at the time. Say you want to run only
the ``download`` step. The `main.py` is written so that the steps are defined at the top of the file, in the
``_steps`` list, and can be selected by using the `steps` parameter on the command line:

```bash
> mlflow run . -P steps=download
```
If you want to run the ``download`` and the ``basic_cleaning`` steps, you can similarly do:
```bash
> mlflow run . -P steps=download,preprocess
```
You can override any other parameter in the configuration file using the Hydra syntax, by
providing it as a ``hydra_options`` parameter. For example, say that we want to set the parameter
random_forest_pipeline -> random_forest -> n_estimators to 10:

```bash
> mlflow run . \
-P hydra_options="random_forest_pipeline.random_forest.n_estimators=10"
```

To enable parallel hyperparameter optimization, you should execute the following:

```bash
> mlflow run . \
-P hydra_options="-m random_forest_pipeline.random_forest.max_depth=range(10,50,3) random_forest_pipeline.tfidf.max_features=range(50,200,50) hydra/launcher=joblib"
```
## License
Distributed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) License. See ```LICENSE``` for more information.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eljandoubi/genre_classification

Awesome Lists containing this project

README