Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ramy-badr-ahmed/higgs-dataset-training

Training Higgs Dataset with Keras - https://doi.org/10.5281/zenodo.13133945
https://github.com/ramy-badr-ahmed/higgs-dataset-training

binary-classification cuda-toolkit cupy dask dask-dataframes higgs-boson keras keras-tensorflow matplotlib matplotlib-python numpy pandas pandas-dataframe scikit-learn uci-dataset uci-machine-learning

Last synced: 4 months ago
JSON representation

Training Higgs Dataset with Keras - https://doi.org/10.5281/zenodo.13133945

Host: GitHub
URL: https://github.com/ramy-badr-ahmed/higgs-dataset-training
Owner: Ramy-Badr-Ahmed
License: apache-2.0
Created: 2024-07-26T20:06:24.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-08-12T09:39:49.000Z (5 months ago)
Last Synced: 2024-10-10T08:23:17.993Z (4 months ago)
Topics: binary-classification, cuda-toolkit, cupy, dask, dask-dataframes, higgs-boson, keras, keras-tensorflow, matplotlib, matplotlib-python, numpy, pandas, pandas-dataframe, scikit-learn, uci-dataset, uci-machine-learning
Language: Python
Homepage:
Size: 3.46 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        ![Python](https://img.shields.io/badge/Python-3670A0?style=plastic&logo=python&logoColor=ffdd54)  ![Pandas](https://img.shields.io/badge/Pandas-%23150458.svg?style=plastic&logo=pandas&logoColor=white) ![NumPy](https://img.shields.io/badge/Numpy-777BB4.svg?style=plastic&logo=numpy&logoColor=white) ![Matplotlib](https://img.shields.io/badge/Matplotlib-%233F4F75.svg?style=plastic&logo=plotly&logoColor=white) ![Keras](https://img.shields.io/badge/Keras-%23D00000.svg?style=plastic&logo=Keras&logoColor=white) ![scikit-learn](https://img.shields.io/badge/scikit--learn-%23F7931E.svg?style=plastic&logo=scikit-learn&logoColor=white) ![Dask](https://img.shields.io/badge/Dask-%23870000.svg?style=plastic&logo=dask&logoColor=white) ![GitHub](https://img.shields.io/github/license/Ramy-Badr-Ahmed/Higgs-Dataset-Training?style=plastic)

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13133945.svg)](https://doi.org/10.5281/zenodo.13133945)

# Model Training and Evaluation for Higgs Dataset

## Overview

This repository demonstrates training and evaluating a Keras model using the Higgs dataset available from the UCI ML Repository.

> [Higgs Dataset](http://archive.ics.uci.edu/ml/datasets/HIGGS)

The dataset has been studied in this publication:

> [Searching for Exotic Particles in High-energy Physics with Deep Learning.
Baldi, P., P. Sadowski, and D. Whiteson. 

Nature Communications 5, 4308 (2014)](https://www.nature.com/articles/ncomms5308)

The ML pipeline includes downloading the dataset, data preparation, model training, evaluation, feature importance analysis, and visualization of results. Dask is utilised for handling this large datasets for parallel processing.

### Installation

1) Create and source virtual environment:

```shell

python -m venv env

source env/bin/activate  # On Windows use `env\Scripts\activate`

```

2) Install the dependencies:

```shell

pip install -r requirements.txt

```

### Data

The Higgs dataset can be downloaded directly from the provided scripts in separate steps

- `download_data.py`  ~ 2.6 GB

- `data_extraction.py` ~ 7 GB

- `data_preparation.py` ~ test dataset: 240 MB, trained dataset: 5 GB

Alternatively, you can run directly the main script from the `data/src/main.py`:

```shell

python data/src/main.py

```

#### Downloading Data

Download a dataset file from the specified URL with a progress bar.

##### Script

```shell

python data/download_data.py

```

##### Example Usage

```python

zipDataUrl = 'https://archive.ics.uci.edu/static/public/280/higgs.zip'      # Higgs dataset URL

zipPath = '../higgs/higgs.zip'

downloadDataset(zipDataUrl, zipPath)

cleanUp(zipPath)        # Clean up downloaded zip file (~ 2.6 GB)

```

#### Data Extraction

Extract the contents of a zip dataset and decompress the .gz dataset file to a specified output path.

##### Script

```shell

python data/data_extraction.py

```

##### Example Usage

```python

zipDataUrl = 'https://archive.ics.uci.edu/static/public/280/higgs.zip'      # Higgs dataset URL

extractTo = '../higgs'

zipPath = os.path.join(extractTo, 'higgs.zip')

gzCsvPath = os.path.join(extractTo, 'higgs.csv.gz')

finalCsvPath = os.path.join(extractTo, 'higgs.csv')

extractZippedData(zipPath, extractTo)

decompressGzFile(gzCsvPath, finalCsvPath)

cleanUp(gzCsvPath)      # Clean up gzipped file (~ 2.6 GB)

```

#### Data Preparation

Set column names and separates the test set from the training data based on the dataset description (500,000 test sets).

Dataset Description: The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features). The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features.

##### Script

```shell

python data/data_preparation.py

```

##### Example Usage

```python

prepareFrom = '../higgs'

csvPath = os.path.join(prepareFrom, 'higgs.csv')

preparedCsvPath = os.path.join(prepareFrom, 'prepared-higgs.csv')

prepareData(csvPath, preparedCsvPath)

cleanUp(csvPath)         # Clean up gzipped file (~ 7.5 GB)

```

### Loading Data

#### Using Pandas

Use the `dataLoader/data_loader.py` script to load the prepared dataset into a Pandas DataFrame.

##### Script

```shell

python data/src/data_loader.py

```

##### Example Usage

```python

filepath = '../data/higgs/prepared-higgs_train.csv'   # prepared-higgs_test.csv

dataLoader = DataLoader(filepath)

dataFrame = dataLoader.loadData()

dataLoader.previewData(dataFrame)

```

#### Using Dask

Use the `dataLoader/data_loader_dask.py` script to load the prepared dataset into a Dask DataFrame, which is beneficial for this large dataset.

##### Script

```shell

python data/src/data_loader_dask.py

```

##### Example Usage:

```python

filepath = '../data/higgs/prepared-higgs_train.csv'   # prepared-higgs_test.csv

dataLoader = DataLoaderDask(filepath)

dataFrame = dataLoader.loadData()

dataLoader.previewData(dataFrame)

```

### Exploratory Data Analysis (EDA)

Provides various functions for performing EDA, including visualising correlations, checking missing values, and plotting feature distributions.

The data analysis plots are saved under `eda/plots`.

##### Script

```shell

python exploration/eda.py

```

##### Example Usage:

```python

filepath = '../data/higgs/prepared-higgs_train.csv'   # prepared-higgs_test.csv

    # using Dask data frame

dataLoaderDask = DataLoaderDask(filepath)

dataFrame = dataLoaderDask.loadData()

eda = EDA(dataFrame)

eda.describeData()

eda.checkMissingValues()

eda.visualiseFeatureCorrelation()

eda.visualizeTargetDistribution()

eda.visualizeFeatureDistribution('feature_1')

eda.visualizeAllFeatureDistributions()

eda.visualizeFeatureScatter('feature_1', 'feature_2')

eda.visualizeTargetDistribution()

eda.visualizeFeatureBoxplot('feature_2')

```

### Usage

#### Training the Model

The model is defined using Keras with the following default architecture for binary classification:

- Input layer with 128 neurons (dense)

- Hidden layer with 64 neurons (dense)

- Output layer with 1 neuron (activation function: sigmoid)

You can customise the model architecture by providing a different `modelBuilder` callable in the ModelTrainer class.

The trained models and training loss plots are saved under `kerasModel/trainer/trainedModels`.

##### Script

```shell

python kerasModel/trainer/model_trainer.py

```

##### Example Usage:

```python

filePath = '../../data/higgs/prepared-higgs_train.csv'

def customModel(inputShape: int) -> Model:

    """Example of a custom model builder function for classification"""

    model = keras.Sequential([

        layers.Input(shape=(inputShape,)),

        layers.Dense(512, activation='relu'),

        layers.Dropout(0.3),

        layers.Dense(256, activation='relu'),

        layers.Dense(128, activation='relu'),

        layers.Dense(64, activation='relu'),

        layers.Dense(1, activation='sigmoid')  # Sigmoid for binary classification

    ])

    return model

        

dataLoaderDask = DataLoaderDask(filePath)

dataFrame = dataLoaderDask.loadData()

## Optional: Define model training/compiling/defining parameters as a dictionary and pass it to the class constructor

params = {

    "epochs": 10,

    "batchSize": 32,

    "minSampleSize": 100000,

    "learningRate": 0.001,

    "modelBuilder": customModel,     # callable

    "loss": 'binary_crossentropy',    

    "metrics": ['accuracy']

}

trainer = ModelTrainer(dataFrame, params)

trainer.trainKerasModel()           # optional: Train the Keras model with sampling, Set: trainKerasModel(sample = true, frac = 0.1).

trainer.plotTrainingHistory()

```

#### Evaluating the Model

The evaluation script computes metrics like:

- Accuracy

- Precision

- Recall (Sensitivity

- F1 Score

- Classification Report

The evaluation includes visualizations such as

- Confusion Matrix

- ROC Curve

The evaluation results are logged and saved to a file under `kerasModel/evaluator/evaluationPlots`.

##### Script

```shell

python kerasModel/evaluator/model_evaluator.py

```

##### Example Usage:

```python

modelPath = '../trainer/trainedModels/keras_model_trained_dataset.keras'

filePath = '../../data/higgs/prepared-higgs_train.csv'

dataLoaderDask = DataLoaderDask(filePath)

dataFrame = dataLoaderDask.loadData()

evaluator = ModelEvaluator(modelPath, dataFrame)

evaluator.evaluate()

```

#### Feature Importance Analysis

The feature importance is computed using permutation importance and visualised using a bar chart. It is implemented once using the Pandas approach (with SciKit) and another using Dask for parallel processing.

The chart and the result CSV file are saved under `kerasModel/featureImportance/featureImportancePlots`.

##### Script

```shell

python kerasModel/featureImportance/feature_importance.py

```

##### Example Usage:

```python

modelPath = '../trainer/trainedModels/keras_model_test_dataset.keras'

filePath = '../../data/higgs/prepared-higgs_test.csv'

dataLoaderDask = DataLoaderDask(filePath)

dataFrame = dataLoaderDask.loadData()

evaluator = FeatureImportanceEvaluator(modelPath, dataFrame)

evaluator.evaluate()

        

        # Alternatively

evaluator = FeatureImportanceEvaluator(modelPath, dataFrame, sampleFraction = 0.1, nRepeats=32)  # with sampling

evaluator.evaluate(withDask = False)        # with pandas

```