https://github.com/ml4net/ssh-shell-attacks
Project for Machine Learning for Networking Exam @ Polito - SSH Shell Attacks Analysis: a project to classify attacker tactics and identify patterns in 230,000 honeypot-captured Unix shell attacks using MITRE ATT&CK framework and ML techniques.
https://github.com/ml4net/ssh-shell-attacks
attack clustering cybersecurity dataset honeypots language-model machine-learning mitre-attack monitoring network-analysis networking shell ssh supervised-learning unix-shell unsupervised-learning
Last synced: about 2 months ago
JSON representation
Project for Machine Learning for Networking Exam @ Polito - SSH Shell Attacks Analysis: a project to classify attacker tactics and identify patterns in 230,000 honeypot-captured Unix shell attacks using MITRE ATT&CK framework and ML techniques.
- Host: GitHub
- URL: https://github.com/ml4net/ssh-shell-attacks
- Owner: ML4Net
- License: mit
- Created: 2024-11-18T17:16:32.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-02-01T13:44:39.000Z (4 months ago)
- Last Synced: 2025-02-01T14:33:11.809Z (4 months ago)
- Topics: attack, clustering, cybersecurity, dataset, honeypots, language-model, machine-learning, mitre-attack, monitoring, network-analysis, networking, shell, ssh, supervised-learning, unix-shell, unsupervised-learning
- Language: Jupyter Notebook
- Homepage:
- Size: 120 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SSH Shell Attacks

Table of Contents
• [Overview](#overview)
• [Dataset](#dataset)
• [Project Report](#project-report)
• [Project Structure](#project-structure)
• [Tools and Technologies](#tools-and-technologies)
• [How to Run the Project](#how-to-run-the-project)
• [Detailed Documentation](#detailed-documentation)
• [Data Directory](data/README.md)
• [Notebooks Directory](notebooks/README.md)
• [Results Directory](results/README.md)
• [Scripts Directory](scripts/README.md)
• [Tests Directory](tests/README.md)
• [Authors](#authors)
• [License](#license)
• [Acknowledgments](#acknowledgments)
Last updated: January 2025## Overview
This project is part of the [Machine Learning for Networking](https://didattica.polito.it/pls/portal30/gap.pkg_guide.viewGap?p_cod_ins=01DSMUV&p_a_acc=2025&p_header=S&p_lang=IT&multi=N) course at **Politecnico di Torino**. It focuses on analyzing SSH shell attack sessions recorded from honeypot deployments to classify attacker intents and explore underlying patterns.
- Original Project Repository: [ML4Net/SSH-Shell-Attacks](https://github.com/ML4Net/SSH-Shell-Attacks)
- Original Report Repository: [ML4Net/latex-report](https://github.com/ML4Net/latex-report)> **Navigation Tip**: This `README` provides a general overview of the project. For detailed documentation, check the specific `README` files in each directory ([see Table of Contents above](#table-of-contents)). Each subdirectory contains in-depth information about its specific components.
> **Quick Links**:
>
> - For data structure and preprocessing: [Data Documentation](data/README.md)
> - For analysis notebooks: [Notebooks Documentation](notebooks/README.md)
> - For implementation details: [Scripts Documentation](scripts/README.md)### Objectives
1. **Classification:** Automatically identify and assign attacker intents (e.g., `Persistence`, `Discovery`) to each SSH attack session.
2. **Clustering:** Group similar attack sessions to uncover attack patterns and fine-grained categories.
3. **Language Models:** Explore advanced NLP techniques like BERT and Doc2Vec for improved classification performance.## Dataset
The dataset consists of approximately 230,000 Unix shell attack sessions recorded from honeypots. It includes:
- **Session Commands:** Malicious commands executed in an SSH session.
- **Timestamps:** The exact time each attack started.
- **Labels:** Pre-assigned intents based on the MITRE ATT&CK framework.### Intents (Classes)
The dataset uses 7 main intent classes:
1. **Persistence**
2. **Discovery**
3. **Defense Evasion**
4. **Execution**
5. **Impact**
6. **Other** (Miscellaneous intents)
7. **Harmless** (Non-malicious commands)## Project Report
The project report is a comprehensive document detailing the methodologies, experiments, and findings of the SSH Shell Attacks project.
- **Format:** PDF
- **Template:** ACM format single column (acmlarge)The report is named [SSH-Shell-Attacks-report.pdf](SSH-Shell-Attacks-report.pdf) and can be found in the root directory of the repository.
There is also an appendix of the project that contains extra plots and additional information. The appendix is also in the root directory, in PDF format, and uses the same ACM format single column template. The appendix is named [SSH-Shell-Attacks-appendix.pdf](SSH-Shell-Attacks-appendix.pdf).
The original source code of the report can be found in the repo [latex-report](https://github.com/ML4Net/latex-report).
## Project Structure
```plaintext
SSH-Shell-Attacks/
│
├── data/ # Dataset and related resources
│ ├── raw/ # Original dataset files (e.g., ssh_attacks.parquet)
│ └── processed/ # Pre-processed and feature-engineered files
│
├── notebooks/ # Jupyter notebooks
│
├── scripts/ # Python scripts for algorithms and utilities
│
├── results/ # Outputs from the models and analysis
│ ├── figures/ # Plots and visualizations
│ ├── models/ # Saved models (e.g., .pkl, .h5)
│ └── metrics/ # Evaluation metrics and reports
│
├── README.md # High-level overview of the project
├── SSH-Shell-Attacks-report.pdf # Report of the project
├── SSH-Shell-Attacks-appendix.pdf # Appendix of the report
├── requirements.txt # Python dependencies
├── .gitignore # Ignore unnecessary files for versioning
└── LICENSE # Licensing information (optional)
```## Tools and Technologies
- **Programming Language:** Python
- **Libraries:**
- Data Processing: `pandas`, `numpy`, `pyarrow`
- Visualization: `matplotlib`, `seaborn`
- Machine Learning: `scikit-learn`
- Clustering: `scikit-learn`, `wordcloud`
- Language Models: `scikit-learn`, `transformers`, `torch`## How to Run the Project
1. **Clone the Repository:**
```bash
git clone https://github.com/ML4Net/SSH-Shell-Attacks.git
cd SSH-Shell-Attacks
```2. **Install Dependencies:**
```bash
pip install -r requirements.txt
```3. **Execute the Notebooks:**
Open the relevant notebook for each section and follow the instructions to:- Load the dataset.
- Perform data exploration.
- Train and evaluate machine learning models.Notebooks:
- `section0_data_preprocessing_and_cleaning.ipynb`
- `section1_data_exploration_and_preprocessing.ipynb`
- `section2_supervised_learning_classification.ipynb`
- `section3_unsupervised_learning_clustering.ipynb`
- `section4_language_model_exploration.ipynb`4. **Explore Scripts:**
Run modular scripts in the `scripts/` directory for specific tasks like preprocessing or model training.---
## Authors
| Name | GitHub | LinkedIn | Email |
| ----------------- | -------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Andrea Botticella | [](https://github.com/Botti01) | [](https://www.linkedin.com/in/andrea-botticella-353169293/) | [](mailto:[email protected]) |
| Elia Innocenti | [](https://github.com/eliainnocenti) | [](https://www.linkedin.com/in/eliainnocenti/) | [](mailto:[email protected]) |
| Renato Mignone | [](https://github.com/RenatoMignone) | [](https://www.linkedin.com/in/renato-mignone/) | [](mailto:[email protected]) |
| Simone Romano | [](https://github.com/sroman0) | [](https://www.linkedin.com/in/simone-romano-383277307/) | [](mailto:[email protected]) |## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- **Luca Vassio** ([[email protected]](mailto:[email protected])): the professor supervising our work and the primary point of reference for the project.
- **Matteo Boffa** ([[email protected]](mailto:[email protected])): the creator and organizer of this project.
- **Team Members**: Andrea Botticella, Elia Innocenti, Renato Mignone, and Simone Romano.Please cite us if this project is copied, used for inspiration, or if any material is taken from it.