Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/giorgiosld/log-anomaly-detection-via-llms

Final project for the T-725-MALV course at Reykjavik University (Fall 2024), exploring Large Language Models (LLaMA, SpanBERT) for anomaly detection in system logs through fine-tuning and benchmarking against traditional methods.
https://github.com/giorgiosld/log-anomaly-detection-via-llms

Last synced: about 1 month ago
JSON representation

Final project for the T-725-MALV course at Reykjavik University (Fall 2024), exploring Large Language Models (LLaMA, SpanBERT) for anomaly detection in system logs through fine-tuning and benchmarking against traditional methods.

Awesome Lists containing this project

README

        

# Log-Anomaly-Detection-via-LLMs
This repository showcases an end-to-end workflow for anomaly detection using large language models (LLMs) such as BERT and LLAMA.
The project was developed as part of my final coursework for the T-725-MALV Natural Language Processing course,
conducted in the Fall semester of 2024 at Reykjavik University. The code was run and tested on Elja, a High Performance
Computing (HPC) environment located in Iceland.

## Project Overview
The primary goal of this project is to apply NLP techniques to the field of log anomaly detection. By leveraging modern
transformer-based models, this project focuses on detecting anomalies in system logs, which is a crucial task in fields
like cybersecurity and systems reliability. The project demonstrates how LLMs can be fine-tuned to detect
irregularities in log files, providing a powerful tool for monitoring and safeguarding complex infrastructures.

The dataset used is from HDFS, a well-known distributed file system, which contains both normal and anomalous log traces.
Using BERT and LLAMA, alongside a model-agnostic modular design, this repository serves as an exploration of advanced
anomaly detection techniques that are particularly relevant in the domains of AI, ML, and cybersecurity.

## Project Structure
The repository is organized as follows:
- `dataset/`: Contains resources related to the dataset, including scripts for downloading, analyzing, and preprocessing the dataset. It also contains the raw dataset files used for training and evaluation.
- `models/`: Holds model-specific code. Currently, it includes subdirectories for BERT model. Each subdirectory contains code for data loading, model initialization, and training.
- `scripts/`: Contains main scripts used to fine-tune models. These scripts serve as entry points for training models.
- `LICENSE`: Contains license information for the project.
- `requirements.txt`: Lists the dependencies required to run the code.
- `run_fine_tune_bert.sh`: A shell script to execute the fine-tuning of the BERT model in the HPC environment. It contains the necessary SBATCH configurations to run the training process efficiently on a cluster.

## Motivation
This project highlights my interest in the intersection of Artificial Intelligence, Cybersecurity, and Natural Language
Processing. Given the importance of anomaly detection in ensuring the safety of modern digital infrastructure, the
project focuses on the use of state-of-the-art language models to tackle the challenge of log-based anomaly detection.
This work demonstrates how advanced NLP techniques, including the use of large language models (LLMs), can be adapted
to solve practical cybersecurity challenges in large, distributed systems.

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgements
- Professor Stefán Ólafsson for academic guidance throughout the project.
- HPC Iceland for the computational resources that made large-scale training feasible.