https://github.com/Parag000/GPU-Accelerated-ML-For-Big-Data-Processing
This project predicts the "Scariest Monster" using a dataset of 12 million entries and 106 features. Utilizing GPU-accelerated processing and the Random Forest Regressor using the Nvidia Rapids API. The goal is to minimize RMSE for accurate predictions
https://github.com/Parag000/GPU-Accelerated-ML-For-Big-Data-Processing
cudf cuml python3 rapidsai sklearn
Last synced: about 2 months ago
JSON representation
This project predicts the "Scariest Monster" using a dataset of 12 million entries and 106 features. Utilizing GPU-accelerated processing and the Random Forest Regressor using the Nvidia Rapids API. The goal is to minimize RMSE for accurate predictions
- Host: GitHub
- URL: https://github.com/Parag000/GPU-Accelerated-ML-For-Big-Data-Processing
- Owner: Parag000
- Created: 2024-11-05T16:37:10.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-11-05T17:50:11.000Z (7 months ago)
- Last Synced: 2025-03-29T06:22:52.871Z (2 months ago)
- Topics: cudf, cuml, python3, rapidsai, sklearn
- Language: Jupyter Notebook
- Homepage: https://colab.research.google.com/drive/1N9I19hw8kbnNC2j3QmP5nTD5GbgGF5qW?usp=sharing
- Size: 296 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 👻 Predicting the Scariest Monster - Nvidia Hackathon
## Link to Colab Notebook:
[Open this notebook in Google Colab](https://colab.research.google.com/drive/1N9I19hw8kbnNC2j3QmP5nTD5GbgGF5qW?usp=sharing)## Our Submission Scores
## Project Overview
This project presents a solution for the **ODSC 2024 NVIDIA Hackathon**, where data scientists are challenged to predict the "Scariest Monster" using a massive dataset filled with 12 million entries, each described by 106 anonymous features. The ultimate goal is to forecast the number of votes each monster received in a global terror poll, utilizing GPU-accelerated data processing and machine learning techniques.
## Dataset
The competition dataset includes:
- **12 million monster entries**
- **106 anonymous features** (a mix of categorical and numerical)
- **Target variable 'y'**: Number of votes each monster received in the global terror poll
- **Dataset size**: Approximately **8-10GB**## Approach
Our approach to tackling this challenge involves the following steps:
1. **Data Loading and Preprocessing**:
- Loading the data using **cuDF** (RAPIDS NVIDIA API) for GPU-accelerated processing.
- Performing basic **Exploratory Data Analysis (EDA)** to understand the dataset.
- Dropping categorical columns to avoid creating sparse matrices.
- Applying **mean imputation** for numerical columns.
- Removing outliers and performing **robust normalization** for stability.2. **Memory-Efficient Train-Test Split**:
- Creating a custom train-test split method to handle memory constraints effectively.
- Using a random shuffled column for efficient data shuffling and splitting.3. **Model Training**:
- Implementing a **Random Forest Regressor** using the RAPIDS **cuML** library for GPU-accelerated processing.4. **Post-processing**:
- Applying **inverse robust scaling** to calculate the final RMSE value.5. **Prediction and Submission**:
- Generating predictions on the test set.
- Preparing the submission file in accordance with the competition guidelines.## Technologies Used
- **Used A100 hardware acceleration**
- **Python 3.x**
- **RAPIDS cuDF** for GPU-accelerated data processing
- **RAPIDS cuML** for GPU-accelerated machine learning
- **Scikit-learn** for preprocessing and metrics
- **Google Colab Notebook** for interactive development## Results
The model's performance is evaluated based on **Root Mean Squared Error (RMSE)**, with lower scores indicating better performance.
## Getting Started
1. **Clone this repository**:
```bash
https://github.com/Parag000/Nvidia-Data-Science-Competition.git## Leaderboard
