An open API service indexing awesome lists of open source software.

https://github.com/titansage02/indabax-hackacton


https://github.com/titansage02/indabax-hackacton

Last synced: 2 months ago
JSON representation

Awesome Lists containing this project

README

        

Welcome to the Cryptojacking Detection Challenge

Cryptojacking Detection Challenge

*About the problem*
> Cryptojacking is a cyber-attack utilizing malicious scripts similar to those from large cryptocurrency houses to illegally mine data without users being aware. These attacks are stealthy and difficult to detect or analyze, often leading to decreased computing speeds for users as well as crashes due to straining of computational resources.

*The objective of this challenge is*:
> classify network activity from various websites as either cryptojacking or not based on features related to both network-based and host-based data.

## 1. **Overview and Objectives:**

### **Solution Description:**
The code implements a binary classification solution to predict a target label from multiple features present in the dataset provided for the IndabaX Challenge. The solution uses a supervised learning model based on the **XGBoost** algorithm, a powerful classifier built on decision tree optimization. The main steps include data extraction, feature engineering, data preparation, model training, performance evaluation, and submission generation for the challenge.

### **Main Objectives of the Project:**
The primary objective of this project is to predict a target variable (named `Label`) from various features extracted from the dataset. The code addresses the following issues:
- **Data Preparation**: Loading data, cleaning, handling missing values, and creating new features.
- **Modeling**: Using XGBoost to train a binary classification model.
- **Evaluation**: Measuring the model's performance through standard metrics like F1-score, accuracy, and confusion matrix.

### **Expected Results:**
The expected results include a model capable of making predictions on test data and generating a submission file ready to be used for a prediction competition. The model evaluation is based on the **F1-score** metric.

---

## 2. **Architecture Diagram:**
Here is a diagram to represent the process architecture:

```plaintext
+--------------------+
| Data Loading |
| Process |
+--------------------+
|
v
+--------------------+
| Data Preprocessing |
| (Cleaning, adding |
| new features) |
+--------------------+
|
v
+--------------------+
| Data Splitting |
| (Train and Test) |
+--------------------+
|
v
+--------------------+
| XGBoost Model |
| Training |
+--------------------+
|
v
+--------------------+
| Prediction on Test |
| Data |
+--------------------+
|
v
+--------------------+
| Model Evaluation |
| (F1-score, Conf. |
| Matrix) |
+--------------------+
|
v
+--------------------+
| Submission File |
| Generation (CSV) |
+--------------------+
```

### Flow Explanation:
1. **Data Loading**: Data is extracted.
2. **Data Preprocessing**: Data is cleaned, and additional features are created.
3. **Data Splitting**: The dataset is divided into training and test sets.
4. **XGBoost Model Training**: The model is trained on the training data.
5. **Prediction on Test Data**: The model makes predictions on the test data.
6. **Model Evaluation**: The model is evaluated using metrics like F1-score and confusion matrix.
7. **Submission Generation**: The results are formatted into a CSV file for submission.

## 3. **ETL Process:**

### **Extraction:**
Data is extracted from CSV files downloaded from a Kaggle dataset via `kagglehub.dataset_download()`. These files are then loaded into Pandas DataFrames.

### **Transformation:**
Transformations are applied to the data to add new features based on ratios calculated from different columns, such as:
- `page_error_read_ratio` (the ratio of pages read to page errors).
- `bytes_ratio` (the ratio of bytes received to bytes sent).

These transformations help generate features that can improve the model's performance.

### **Loading:**
The transformed data is stored in Pandas DataFrames. The test data is prepared for use in the prediction phase, and the model predicts the values of the test data before generating a CSV file for submission.

---

## 4. **Data Modeling:**

### **Model Used:**
The primary model used for classification is **XGBoost** (`xgb.XGBClassifier`), a powerful model for binary classification tasks. The key parameters of the model are:
- **max_depth**: 6
- **learning_rate**: 0.05
- **n_estimators**: 1500
- **scale_pos_weight**: weight adjustment for imbalanced classes.
- **min_child_weight**: 0.6

The model is trained on the training data after splitting the data into training and validation sets (80% and 20%).

### **Validation:**
Evaluation is performed via a **classification_report**, **F1-score**, and the **confusion matrix**. This helps measure accuracy and identify misclassifications between classes 0 and 1.

---

## 5. **Inference:**

### **Prediction:**
Once the model is trained, it is used to predict the labels (`Label`) for the test dataset. The test data is transformed to match the same features used to train the model.

### **Services Involved:**
Inference is performed by the XGBoost model, and the results are used to generate a submission CSV file, which can then be downloaded for evaluation in a competition setting.

---

## 6. **Execution Time:**

The full execution of the entire code does not exceed 3 minutes.

---

## 7. **Performance Metrics:**

The metrics used to evaluate the model’s performance include:
- **F1-score**: A measure of precision and recall, suitable for imbalanced datasets.
- **Confusion Matrix**: To visualize true and false classifications.

Example output for evaluation:
```
F1_score: 0.964509394572025

Confusion Matrix:
[[1286 15]
[ 19 462]]
```