https://github.com/pinsaraperera/fine-tuning-bert-with-low-rank-adaptation

fine-tuned BERT Base 110M parameters model with Low-Rank Adaptation (LoRA). The model is trained on a dataset of URLs labeled as either phishing or legitimate.
https://github.com/pinsaraperera/fine-tuning-bert-with-low-rank-adaptation

bert-fine-tuning low-rank-adaptation phishing-detection tokenizer

Last synced: 6 months ago
JSON representation

fine-tuned BERT Base 110M parameters model with Low-Rank Adaptation (LoRA). The model is trained on a dataset of URLs labeled as either phishing or legitimate.

Host: GitHub
URL: https://github.com/pinsaraperera/fine-tuning-bert-with-low-rank-adaptation
Owner: PinsaraPerera
Created: 2025-02-27T16:27:27.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-02-27T17:52:19.000Z (8 months ago)
Last Synced: 2025-02-27T23:40:46.229Z (8 months ago)
Topics: bert-fine-tuning, low-rank-adaptation, phishing-detection, tokenizer
Language: Jupyter Notebook
Homepage:
Size: 128 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Fine-Tuning BERT with LoRA for Text Classification

This project focuses on identifying phishing URLs using a fine-tuned BERT model with **Low-Rank Adaptation (LoRA)**. The model is trained on a dataset of URLs labeled as either phishing or not-phishing. The project is implemented in Python using the Hugging Face `transformers` library and is designed to run in Google Colab.

## Table of Contents

- [Overview](#overview)

- [Installation](#installation)

- [Usage](#usage)

  - [Training the Model](#training-the-model)

  - [Inference](#inference)

- [Saving and Loading the Model](#saving-and-loading-the-model)

- [Results](#results)

## Overview

Phishing attacks are a significant threat to online security. This project leverages the power of **BERT**, a state-of-the-art transformer model, fine-tuned with **LoRA** for efficient adaptation to the task of phishing URL identification. The model is trained on a labeled dataset of URLs and can classify new URLs as either phishing or legitimate.

*Benefits of LoRA: LoRA reduces the number of trainable parameters, making fine-tuning faster and more memory-efficient while maintaining performance.*

## Installation

To run this project, you need to install the following dependencies:

```bash

pip install transformers datasets torch peft evaluate numpy hugginface_hub

```

## Usage

### Training the Model

1. **Prepare the Dataset**: Ensure your dataset is in the correct format (e.g., a CSV file with `url` and `label` columns).

2. **Tokenize the Data**: Use the Hugging Face tokenizer to preprocess the URLs.

3. **Fine-Tune the Model**: Run the training script to fine-tune the BERT model with LoRA.

```python

from transformers import Trainer, TrainingArguments

# Define training arguments

training_args = TrainingArguments(

    output_dir="bert-phishing-classifier",

    learning_rate=lr,

    per_device_train_batch_size=batch_size,

    per_device_eval_batch_size=batch_size,

    num_train_epochs=num_epochs,

    logging_strategy="epoch",

    eval_strategy="epoch",

    save_strategy="epoch",

    load_best_model_at_end=True,

    report_to="none",

)

# Initialize the Trainer

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=tokenized_data["train"],

    eval_dataset=tokenized_data["test"],

    tokenizer=tokenizer,

    data_collator=data_collator,

    compute_metrics=compute_metrics,

)

# Train the model

trainer.train()

```

### Inference

To classify a new URL, load the fine-tuned model and tokenizer, then pass the URL through the model:

```python

import torch

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer

model = AutoModelForSequenceClassification.from_pretrained("/path/to/saved/model")

tokenizer = AutoTokenizer.from_pretrained("/path/to/saved/tokenizer")

# Example URL

url = "http://example.com"

# Tokenize the URL

inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True)

# Make prediction

with torch.no_grad():

    logits = model(**inputs).logits

# Convert logits to probabilities

probs = torch.softmax(logits, dim=-1)

# Get the predicted class

predicted_class = torch.argmax(probs, dim=-1).item()

# Map the predicted class to a label

label = "Phishing" if predicted_class == 1 else "Legitimate"

print(f"The URL is classified as: {label}")

```

## Saving and Loading the Model

### Save the Model and Tokenizer

After training, save the model and tokenizer to a directory:

```python

model.save_pretrained("/path/to/save/model")

tokenizer.save_pretrained("/path/to/save/tokenizer")

```

### Load the Model and Tokenizer

To load the model and tokenizer for inference:

```python

model = AutoModelForSequenceClassification.from_pretrained("/path/to/saved/model")

tokenizer = AutoTokenizer.from_pretrained("/path/to/saved/tokenizer")

```

## Results



---

### The model achieves the following performance on the validation dataset:

- **Accuracy**: 0.884

- **AUC**: 0.946

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pinsaraperera/fine-tuning-bert-with-low-rank-adaptation

Awesome Lists containing this project

README