Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/eljandoubi/genre_classification
Create an ML pipeline for Genre Classification using MLflow.
https://github.com/eljandoubi/genre_classification
hydra machine-learning mlflow numpy pandas pandas-profiling pytest scikit-learn scipy wandb
Last synced: about 2 months ago
JSON representation
Create an ML pipeline for Genre Classification using MLflow.
- Host: GitHub
- URL: https://github.com/eljandoubi/genre_classification
- Owner: eljandoubi
- License: apache-2.0
- Created: 2023-08-29T14:07:15.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-09-03T11:34:15.000Z (over 1 year ago)
- Last Synced: 2024-04-18T15:09:08.979Z (9 months ago)
- Topics: hydra, machine-learning, mlflow, numpy, pandas, pandas-profiling, pytest, scikit-learn, scipy, wandb
- Language: Python
- Homepage:
- Size: 57.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# genre_classification
The primary objective of this project is to develop a machine learning pipeline capable of accurately classifying the genre of songs.## Table of contents
- [Introduction](#genre_classification)
- [Preliminary steps](#preliminary-steps)
* [Clone repository](clone-repository)
* [Create environment](#create-environment)
* [Get API key for Weights and Biases](#get-api-key-for-weights-and-biases)
* [The configuration](#the-configuration)
* [Running the entire pipeline or just a selection of steps](#Running-the-entire-pipeline-or-just-a-selection-of-steps)
- [Licence](#license)
## Preliminary steps
### Clone repositoryClone the repository locally so you can start working on it:
```
git clone https://github.com/eljandoubi/genre_classification.git
```and go into the repository:
```
cd genre_classification
```### Create environment
Make sure to have conda installed and ready, then create a new environment using the ``environment.yml``
file provided in the root of the repository and activate it:```bash
> conda env create -f environment.yml
> conda activate genre_classification
```### Get API key for Weights and Biases
Let's make sure we are logged in to Weights & Biases. Get your API key from W&B by going to
[https://wandb.ai/authorize](https://wandb.ai/authorize) and click on the + icon (copy to clipboard),
then paste your key into this command:```bash
> wandb login [your API key]
```You should see a message similar to:
```
wandb: Appending key for api.wandb.ai to your netrc file: /home/[your username]/.netrc
```
### The configuration
As usual, the parameters controlling the pipeline are defined in the ``config.yaml`` file defined in
the root. We will use Hydra to manage this configuration file.
Open this file and get familiar with its content. Remember: this file is only read by the ``main.py`` script
(i.e., the pipeline) and its content is
available with the ``go`` function in ``main.py`` as the ``config`` dictionary. For example,
the name of the project is contained in the ``project_name`` key under the ``main`` section in
the configuration file. It can be accessed from the ``go`` function as
``config["main"]["project_name"]``.NOTE: do NOT hardcode any parameter when writing the pipeline. All the parameters should be
accessed from the configuration file.### Running the entire pipeline or just a selection of steps
In order to run the pipeline when you are developing, you need to be in the root of the starter kit,
then you can execute as usual:```bash
> mlflow run .
```
This will run the entire pipeline.When developing it is useful to be able to run one step at the time. Say you want to run only
the ``download`` step. The `main.py` is written so that the steps are defined at the top of the file, in the
``_steps`` list, and can be selected by using the `steps` parameter on the command line:```bash
> mlflow run . -P steps=download
```
If you want to run the ``download`` and the ``basic_cleaning`` steps, you can similarly do:
```bash
> mlflow run . -P steps=download,preprocess
```
You can override any other parameter in the configuration file using the Hydra syntax, by
providing it as a ``hydra_options`` parameter. For example, say that we want to set the parameter
random_forest_pipeline -> random_forest -> n_estimators to 10:```bash
> mlflow run . \
-P hydra_options="random_forest_pipeline.random_forest.n_estimators=10"
```To enable parallel hyperparameter optimization, you should execute the following:
```bash
> mlflow run . \
-P hydra_options="-m random_forest_pipeline.random_forest.max_depth=range(10,50,3) random_forest_pipeline.tfidf.max_features=range(50,200,50) hydra/launcher=joblib"
```
## License
Distributed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) License. See ```LICENSE``` for more information.