Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ramy-badr-ahmed/higgs-dataset-training
Training Higgs Dataset with Keras - https://doi.org/10.5281/zenodo.13133945
https://github.com/ramy-badr-ahmed/higgs-dataset-training
binary-classification cuda-toolkit cupy dask dask-dataframes higgs-boson keras keras-tensorflow matplotlib matplotlib-python numpy pandas pandas-dataframe scikit-learn uci-dataset uci-machine-learning
Last synced: 4 months ago
JSON representation
Training Higgs Dataset with Keras - https://doi.org/10.5281/zenodo.13133945
- Host: GitHub
- URL: https://github.com/ramy-badr-ahmed/higgs-dataset-training
- Owner: Ramy-Badr-Ahmed
- License: apache-2.0
- Created: 2024-07-26T20:06:24.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-08-12T09:39:49.000Z (5 months ago)
- Last Synced: 2024-10-10T08:23:17.993Z (4 months ago)
- Topics: binary-classification, cuda-toolkit, cupy, dask, dask-dataframes, higgs-boson, keras, keras-tensorflow, matplotlib, matplotlib-python, numpy, pandas, pandas-dataframe, scikit-learn, uci-dataset, uci-machine-learning
- Language: Python
- Homepage:
- Size: 3.46 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![Python](https://img.shields.io/badge/Python-3670A0?style=plastic&logo=python&logoColor=ffdd54) ![Pandas](https://img.shields.io/badge/Pandas-%23150458.svg?style=plastic&logo=pandas&logoColor=white) ![NumPy](https://img.shields.io/badge/Numpy-777BB4.svg?style=plastic&logo=numpy&logoColor=white) ![Matplotlib](https://img.shields.io/badge/Matplotlib-%233F4F75.svg?style=plastic&logo=plotly&logoColor=white) ![Keras](https://img.shields.io/badge/Keras-%23D00000.svg?style=plastic&logo=Keras&logoColor=white) ![scikit-learn](https://img.shields.io/badge/scikit--learn-%23F7931E.svg?style=plastic&logo=scikit-learn&logoColor=white) ![Dask](https://img.shields.io/badge/Dask-%23870000.svg?style=plastic&logo=dask&logoColor=white) ![GitHub](https://img.shields.io/github/license/Ramy-Badr-Ahmed/Higgs-Dataset-Training?style=plastic)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13133945.svg)](https://doi.org/10.5281/zenodo.13133945)
# Model Training and Evaluation for Higgs Dataset
## Overview
This repository demonstrates training and evaluating a Keras model using the Higgs dataset available from the UCI ML Repository.
> [Higgs Dataset](http://archive.ics.uci.edu/ml/datasets/HIGGS)
The dataset has been studied in this publication:
> [Searching for Exotic Particles in High-energy Physics with Deep Learning.
Baldi, P., P. Sadowski, and D. Whiteson.
Nature Communications 5, 4308 (2014)](https://www.nature.com/articles/ncomms5308)The ML pipeline includes downloading the dataset, data preparation, model training, evaluation, feature importance analysis, and visualization of results. Dask is utilised for handling this large datasets for parallel processing.
### Installation
1) Create and source virtual environment:
```shell
python -m venv env
source env/bin/activate # On Windows use `env\Scripts\activate`
```
2) Install the dependencies:
```shell
pip install -r requirements.txt
```### Data
The Higgs dataset can be downloaded directly from the provided scripts in separate steps
- `download_data.py` ~ 2.6 GB
- `data_extraction.py` ~ 7 GB
- `data_preparation.py` ~ test dataset: 240 MB, trained dataset: 5 GBAlternatively, you can run directly the main script from the `data/src/main.py`:
```shell
python data/src/main.py
```#### Downloading Data
Download a dataset file from the specified URL with a progress bar.##### Script
```shell
python data/download_data.py
```##### Example Usage
```python
zipDataUrl = 'https://archive.ics.uci.edu/static/public/280/higgs.zip' # Higgs dataset URL
zipPath = '../higgs/higgs.zip'
downloadDataset(zipDataUrl, zipPath)
cleanUp(zipPath) # Clean up downloaded zip file (~ 2.6 GB)
```#### Data Extraction
Extract the contents of a zip dataset and decompress the .gz dataset file to a specified output path.##### Script
```shell
python data/data_extraction.py
```##### Example Usage
```python
zipDataUrl = 'https://archive.ics.uci.edu/static/public/280/higgs.zip' # Higgs dataset URL
extractTo = '../higgs'
zipPath = os.path.join(extractTo, 'higgs.zip')
gzCsvPath = os.path.join(extractTo, 'higgs.csv.gz')
finalCsvPath = os.path.join(extractTo, 'higgs.csv')extractZippedData(zipPath, extractTo)
decompressGzFile(gzCsvPath, finalCsvPath)
cleanUp(gzCsvPath) # Clean up gzipped file (~ 2.6 GB)
```#### Data Preparation
Set column names and separates the test set from the training data based on the dataset description (500,000 test sets).Dataset Description: The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features). The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features.
##### Script
```shell
python data/data_preparation.py
```##### Example Usage
```python
prepareFrom = '../higgs'
csvPath = os.path.join(prepareFrom, 'higgs.csv')
preparedCsvPath = os.path.join(prepareFrom, 'prepared-higgs.csv')
prepareData(csvPath, preparedCsvPath)
cleanUp(csvPath) # Clean up gzipped file (~ 7.5 GB)
```### Loading Data
#### Using Pandas
Use the `dataLoader/data_loader.py` script to load the prepared dataset into a Pandas DataFrame.
##### Script
```shell
python data/src/data_loader.py
```##### Example Usage
```python
filepath = '../data/higgs/prepared-higgs_train.csv' # prepared-higgs_test.csv
dataLoader = DataLoader(filepath)
dataFrame = dataLoader.loadData()
dataLoader.previewData(dataFrame)
```#### Using Dask
Use the `dataLoader/data_loader_dask.py` script to load the prepared dataset into a Dask DataFrame, which is beneficial for this large dataset.
##### Script
```shell
python data/src/data_loader_dask.py
```
##### Example Usage:
```python
filepath = '../data/higgs/prepared-higgs_train.csv' # prepared-higgs_test.csv
dataLoader = DataLoaderDask(filepath)
dataFrame = dataLoader.loadData()
dataLoader.previewData(dataFrame)
```### Exploratory Data Analysis (EDA)
Provides various functions for performing EDA, including visualising correlations, checking missing values, and plotting feature distributions.
The data analysis plots are saved under `eda/plots`.##### Script
```shell
python exploration/eda.py
```##### Example Usage:
```python
filepath = '../data/higgs/prepared-higgs_train.csv' # prepared-higgs_test.csv# using Dask data frame
dataLoaderDask = DataLoaderDask(filepath)
dataFrame = dataLoaderDask.loadData()eda = EDA(dataFrame)
eda.describeData()
eda.checkMissingValues()
eda.visualiseFeatureCorrelation()eda.visualizeTargetDistribution()
eda.visualizeFeatureDistribution('feature_1')
eda.visualizeAllFeatureDistributions()
eda.visualizeFeatureScatter('feature_1', 'feature_2')
eda.visualizeTargetDistribution()
eda.visualizeFeatureBoxplot('feature_2')
```### Usage
#### Training the Model
The model is defined using Keras with the following default architecture for binary classification:- Input layer with 128 neurons (dense)
- Hidden layer with 64 neurons (dense)
- Output layer with 1 neuron (activation function: sigmoid)You can customise the model architecture by providing a different `modelBuilder` callable in the ModelTrainer class.
The trained models and training loss plots are saved under `kerasModel/trainer/trainedModels`.
##### Script
```shell
python kerasModel/trainer/model_trainer.py
```##### Example Usage:
```python
filePath = '../../data/higgs/prepared-higgs_train.csv'def customModel(inputShape: int) -> Model:
"""Example of a custom model builder function for classification"""
model = keras.Sequential([
layers.Input(shape=(inputShape,)),
layers.Dense(512, activation='relu'),
layers.Dropout(0.3),
layers.Dense(256, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid') # Sigmoid for binary classification
])
return model
dataLoaderDask = DataLoaderDask(filePath)
dataFrame = dataLoaderDask.loadData()## Optional: Define model training/compiling/defining parameters as a dictionary and pass it to the class constructor
params = {
"epochs": 10,
"batchSize": 32,
"minSampleSize": 100000,
"learningRate": 0.001,
"modelBuilder": customModel, # callable
"loss": 'binary_crossentropy',
"metrics": ['accuracy']
}
trainer = ModelTrainer(dataFrame, params)
trainer.trainKerasModel() # optional: Train the Keras model with sampling, Set: trainKerasModel(sample = true, frac = 0.1).
trainer.plotTrainingHistory()
```#### Evaluating the Model
The evaluation script computes metrics like:- Accuracy
- Precision
- Recall (Sensitivity
- F1 Score
- Classification ReportThe evaluation includes visualizations such as
- Confusion Matrix
- ROC CurveThe evaluation results are logged and saved to a file under `kerasModel/evaluator/evaluationPlots`.
##### Script
```shell
python kerasModel/evaluator/model_evaluator.py
```##### Example Usage:
```python
modelPath = '../trainer/trainedModels/keras_model_trained_dataset.keras'
filePath = '../../data/higgs/prepared-higgs_train.csv'dataLoaderDask = DataLoaderDask(filePath)
dataFrame = dataLoaderDask.loadData()evaluator = ModelEvaluator(modelPath, dataFrame)
evaluator.evaluate()
```#### Feature Importance Analysis
The feature importance is computed using permutation importance and visualised using a bar chart. It is implemented once using the Pandas approach (with SciKit) and another using Dask for parallel processing.
The chart and the result CSV file are saved under `kerasModel/featureImportance/featureImportancePlots`.
##### Script
```shell
python kerasModel/featureImportance/feature_importance.py
```##### Example Usage:
```python
modelPath = '../trainer/trainedModels/keras_model_test_dataset.keras'
filePath = '../../data/higgs/prepared-higgs_test.csv'dataLoaderDask = DataLoaderDask(filePath)
dataFrame = dataLoaderDask.loadData()evaluator = FeatureImportanceEvaluator(modelPath, dataFrame)
evaluator.evaluate()
# Alternatively
evaluator = FeatureImportanceEvaluator(modelPath, dataFrame, sampleFraction = 0.1, nRepeats=32) # with sampling
evaluator.evaluate(withDask = False) # with pandas
```