https://github.com/tashi-2004/apache-hadoop-spark-hive-cyberanalytics
This project utilizes Apache Hadoop, Hive, and PySpark to process and analyze the UNSW-NB15 dataset, enabling advanced query analysis, machine learning modeling, and visualization. The project demonstrates efficient data ingestion, processing, and predictive analytics for network security insights.
https://github.com/tashi-2004/apache-hadoop-spark-hive-cyberanalytics
ai apache-hadoop apache-hive big-data-analytics big-data-processing data-analysis data-engineering data-science data-security data-visualization hdfs machine-learning network-analysis network-security pyspark python3 threat-detection unsw-nb15-dataset
Last synced: 6 months ago
JSON representation
This project utilizes Apache Hadoop, Hive, and PySpark to process and analyze the UNSW-NB15 dataset, enabling advanced query analysis, machine learning modeling, and visualization. The project demonstrates efficient data ingestion, processing, and predictive analytics for network security insights.
- Host: GitHub
- URL: https://github.com/tashi-2004/apache-hadoop-spark-hive-cyberanalytics
- Owner: tashi-2004
- Created: 2025-02-07T13:47:55.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-02-10T09:54:07.000Z (8 months ago)
- Last Synced: 2025-04-05T18:13:41.200Z (6 months ago)
- Topics: ai, apache-hadoop, apache-hive, big-data-analytics, big-data-processing, data-analysis, data-engineering, data-science, data-security, data-visualization, hdfs, machine-learning, network-analysis, network-security, pyspark, python3, threat-detection, unsw-nb15-dataset
- Language: Jupyter Notebook
- Homepage:
- Size: 2.62 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hadoop-Hive-PySpark-CyberAnalytics
This repository demonstrates a comprehensive big data analytics pipeline tailored for cyber threat analysis using Apache Hadoop, Apache Hive, and PySpark. It leverages the UNSW-NB15 dataset to provide deep insights into cybersecurity threats.
## Prerequisites
Ensure you have the following installed on your system:
- Apache Hadoop (3.3.6)
- Apache Hive (4.0.1)
- PySpark
- Python 3.x
- Jupyter Notebook## Installation
Follow these steps to set up your environment:
1. **Clone the Repository**
Clone the repository and navigate to the project directory using the following commands:
```bash
git clone https://github.com/tashi-2004/Hadoop-Spark-Hive-CyberAnalytics.git
cd Hadoop-Hive-PySpark-CyberAnalytics## Understanding Dataset: UNSW-NB15
The UNSW-NB15 dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). It generates a mix of real modern activities and synthetic contemporary attack behaviors. The dataset contains raw network traffic captured using the **tcpdump** tool (100 GB in size).
- Feature descriptions for the dataset: [Download Features](https://www.dropbox.com/s/c8qrzd99z5s9ub6/UNSW-NB15_features.csv?dl=1)
- The complete UNSW-NB15 dataset: [Download Dataset](https://www.dropbox.com/s/4xqg32ih9xoh5jq/UNSW-NB15.csv?dl=1)### Key Features of the Dataset:
- It includes **nine types of attacks**: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms.
- Tools such as **Argus** and **Bro-IDS** were used, along with twelve algorithms, to generate **49 features**, including a class label.
- A total of **10 million records** are available in the dataset (CSV format), with a total size of approximately **600MB**.### Steps for Analysis:
1. **Explore the dataset** by importing it into Hadoop HDFS.
2. **Use Hive** to query and print the first 5-10 records for better understanding.
3. Proceed with big data analytics using PySpark and Hive for advanced modeling and visualization.## Start Hadoop Services
Navigate to the Hadoop directory and start all necessary services using the `start-all.sh` script:
```bash
start-all.sh
```
### Load Data to HDFS
Put the UNSW-NB15 dataset into HDFS to make it accessible for analysis. Use the following command to load the data:
```bash
hadoop fs -put home/to/UNSW-NB15.csv /user/in/hdfs
```## Execute Hive Queries
After the data is loaded into HDFS, proceed to execute Hive queries to analyze the dataset:
```bash
hive -f hivequeries.hql
```
**Hive Query 1**
**Hive Query 2**
**Hive Query 3**
**Hive Query 4**
**Hive Query 5**
## PySpark Analysis
Following the Hive query execution, use PySpark to perform further data analysis. Run the PySpark notebook to carry out this step:
```bash
pyspark pyspark.ipynb
```
## Key Steps in the Analysis### 1. Data Loading and Preprocessing
The UNSW-NB15 dataset is loaded and preprocessed to prepare for analysis. Below is the preview of the dataset:### 2. Descriptive Statistics
Summary statistics of the dataset, showing count, mean, standard deviation, and range for all features:### 3. Correlation Analysis
A correlation matrix was generated to identify relationships between numerical features:### 4. Kernel Density Estimation
A kernel density plot was created to analyze the distribution of the `duration` feature:### 5. Principal Component Analysis (PCA)
PCA was applied to reduce dimensionality. The first two principal components explain most of the data variability:### 6. K-Means Clustering
K-Means clustering was performed to identify clusters within the dataset:### 7. Classification with Logistic Regression
A Logistic Regression model was used for binary classification, with the following results:
- **Confusion Matrix**:
![]()
- **Classification Report**:
### 8. Classification with Random Forest
A Random Forest classifier was trained for binary classification of normal vs. attack traffic. Below are the confusion matrix and classification report:
- **Confusion Matrix**:
## Contact
For queries or contributions, please contact:
**Tashfeen Abbasi**
Email: abbasitashfeen7@gmail.com