https://github.com/giorgiosld/log-anomaly-detection-via-llms

Final project for the T-725-MALV course at Reykjavik University (Fall 2024), exploring Large Language Models (LLAMA, BERT) for anomaly detection in system logs through fine-tuning and benchmarking against traditional methods.
https://github.com/giorgiosld/log-anomaly-detection-via-llms

cybersecurity finetuning-llms llm log-anomaly-detection

Last synced: 6 months ago
JSON representation

Host: GitHub
URL: https://github.com/giorgiosld/log-anomaly-detection-via-llms
Owner: giorgiosld
License: mit
Created: 2024-10-13T17:35:20.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-11-08T00:25:42.000Z (8 months ago)
Last Synced: 2024-11-10T19:52:23.232Z (8 months ago)
Topics: cybersecurity, finetuning-llms, llm, log-anomaly-detection
Language: Python
Homepage:
Size: 1.96 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Log-Anomaly-Detection-via-LLMs
This repository showcases an end-to-end workflow for anomaly detection using large language models (LLMs) such as BERT and LLAMA.
The project was developed as part of my final coursework for the T-725-MALV Natural Language Processing course,
conducted in the Fall semester of 2024 at Reykjavik University. The code was run and tested on Elja, a High Performance
Computing (HPC) environment located in Iceland.

## Project Overview
The primary goal of this project is to apply NLP techniques to the field of log anomaly detection. By leveraging modern
transformer-based models, this project focuses on detecting anomalies in system logs, which is a crucial task in fields
like cybersecurity and systems reliability. The project demonstrates how LLMs can be fine-tuned to detect
irregularities in log files, providing a powerful tool for monitoring and safeguarding complex infrastructures.

The dataset used is from HDFS, a well-known distributed file system, which contains both normal and anomalous log traces.
Using BERT and LLAMA, alongside a model-agnostic modular design, this repository serves as an exploration of advanced
anomaly detection techniques that are particularly relevant in the domains of AI, ML, and cybersecurity.

## Project Structure
The repository is organized as follows:
- `dataset/`: Contains resources related to the dataset, including scripts for downloading, analyzing, and preprocessing the dataset. It also contains the raw dataset files used for training and evaluation.
- `deployment/`: Contains script for deploying the trained model for presentation.
- `models/`: Holds model-specific code. Currently, it includes subdirectories for BERT model. Each subdirectory contains code for data loading, model initialization, and training.
- `result/`: Contains the results of the experiments, including model evaluation metrics and visualizations.
- `scripts/`: Contains main scripts used to fine-tune models. These scripts serve as entry points for training models.
- `LICENSE`: Contains license information for the project.
- `requirements.txt`: Lists the dependencies required to run the code.
- `run_fine_tune_bert.sh`: A shell script to execute the fine-tuning of the BERT model in the HPC environment. It contains the necessary SBATCH configurations to run the training process efficiently on a cluster.
- `run_fine_tune_llama.sh`: A shell script to execute the fine-tuning of the LLAMA model in the HPC environment. It contains the necessary SBATCH configurations to run the training process efficiently on a cluster.

## Motivation
This project highlights my interest in the intersection of Artificial Intelligence, Cybersecurity, and Natural Language
Processing. Given the importance of anomaly detection in ensuring the safety of modern digital infrastructure, the
project focuses on the use of state-of-the-art language models to tackle the challenge of log-based anomaly detection.
This work demonstrates how advanced NLP techniques, including the use of large language models (LLMs), can be adapted
to solve practical cybersecurity challenges in large, distributed systems.

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgements
- HPC Iceland for the computational resources that made large-scale training feasible.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/giorgiosld/log-anomaly-detection-via-llms

Awesome Lists containing this project

README