An open API service indexing awesome lists of open source software.

https://github.com/zzarif/ai-detector

Detect AI generated coding answers
https://github.com/zzarif/ai-detector

cosine-similarity embeddings fine-tuning flask gpt4 huggingface openai regression sbert sentence-transformers

Last synced: 14 days ago
JSON representation

Detect AI generated coding answers

Awesome Lists containing this project

README

          




AI Detector


Detect AI generated coding answers











## Table of Contents


  1. Problem Statement

  2. Solution 1 Fine-tuning Sentence Transformers


  3. Model Deployment


  4. Solution 2 Large Language Model

  5. Build from Source

  6. Conclusion and Future Work

  7. Miscellaneous

  8. Contact

# Problem Statement

The objective of this project is to develop a Machine Learning model that can detect potential AI use by comparing candidate coding answers to responses generated by AI models (GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo). The model will predict AI-detected score ranging in floating point numbers from `0` (no AI detected) to `1` (a lot of AI-detected). Since the prediction is a continuous value in a range, this is a **Regression** problem which can be solved in multiple ways. In this problem's context, I will demonstrate **two** solutions:

- **Solution 1:** Fine-tuning multiple Sentence Transformers on the given dataset. The best performing fine-tuned model was deployed to [HuggingFace](https://huggingface.co/spaces/zzarif/AI-Detector) and integrated with a [Flask Webapp](https://ai-code-detector.vercel.app/).
- **Solution 2:** Utilizing OpenAI's `GPT-4o` LLM to predict similarity score. Details documented in [this](#large-language-model) section.

Please watch the [YouTube](https://youtu.be/8d27nu480qQ) video presentation of this project, or follow this README file for detailed documentation.

# Solution 1: Fine-tuning Sentence Transformers

## Data Understanding

The original dataset contains examples of coding questions, candidate answers, AI-generated answers, and the corresponding AI-detected scores to train and test the model. The goal is for the model to predict AI-detected scores for new, unseen data. The dataset can be found in the [`data`](/data/) directory of this project. The dataset has one directory and one file as follows:

1. [`dataset-source-codes`](/data/dataset-source-codes/) directory: This directory has 63 subdirectories (`source_code_000` ... `source_code_062`). Each of these subdirectories represents a coding question and its respective answers completed by both candidate and AI. A subdirectory, say, `source_code_000` has the following 8 files:

- `source_code_000.json`: Contains a coding question in a specific programing language (Java in this case) and metadata related to that question
- `source_code_000.jav`: Contains candidate's answer code snippet written in Java
- `source_code_000_gpt-3.5-turbo_00.jav` and `...01.jav`: These two files have two samples of the respective coding answer completed by GPT-3.5-Turbo
- `source_code_000_gpt-4_00.jav` and `...01.jav`: These two files have two samples of the respective coding answer completed by GPT-4
- `source_code_000_gpt-4-turbo_00.jav` and `...01.jav`: These two files have two samples of the respective coding answer completed by GPT-4-Turbo

2. [`CodeAid Source Codes Labeling.xlsx`](/data/CodeAid%20Source%20Codes%20Labeling.xlsx): This file maps a candidate's answers to its respective AI-generated answers and assigns a plagiarism score. It has 3 columns and 378 rows as follows:




coding_problem_id
llm_answer_id
plagiarism_score




1
source_code_000
gpt-3.5-turbo_00
0


2
source_code_000
gpt-3.5-turbo_01
0


3
source_code_000
gpt-4_00
0


...
...
...
...


378
source_code_062
gpt-4-turbo_01
0.30

## Data Preprocessing

We need to preprocess the data to build a consistent dataset structure for the model training and validation. Preprocessing involves the following steps:

1. Load all 63 subdirectories containing the coding questions and answers from the `dataset-source-codes` directory
2. Load `CodeAid Source Codes Labeling.xlsx` file with plagiarism scores
3. Create a new tabular dataset where each row has a coding question, respective candidate answer, AI-generated answer, and the associated plagiarism score. This step creates 378 rows. Since, there are 6 different plagiarism/similarity scores for a candidate's answer (2 samples of AI-generated answers for 3 different variants of LLMs, so 63 * 2 * 3 = 378 rows)
4. Adds two new columns combining coding question with candidate answer and AI-generated answer respectively (This is necessary for feature extraction purposes)

The preprocessed data with 6 columns and 378 rows is as follows:




question
candidate_answer
ai_answer
similarity_score
candidate_combined
ai_combined




1
Write a program to find...
fun findLargestElement...
public class LargestEle...
0.0
Question: Write a program...
Question: Write a program...


2
Write a program to find...
fun findLargestElement...
public class Main {\n ...
0.0
Question: Write a program...
Question: Write a program...


3
Write a program to find...
fun findLargestElement...
public class Main {\n ...
0.0
Question: Write a program...
Question: Write a program...


...
...
...
...
...
...
...


378
Create a PHP script that will...
<?php\nfunction getTop...
<?php\n\n// Function...
0.3
Question: Create a PHP script...
Question: Create a PHP script...

The detailed preprocessing documentation can be found in [`preprocessing.ipynb`](/notebooks/preprocessing.ipynb) Jupyter Notebook or [`preprocessing.py`](/scripts/preprocessing.py) file. The preprocessed data is saved as [`preprocessed_data.csv`](/data/preprocessed_data.csv) file.

## Model Training

5 different Sentence Transformers were selected for fine-tuning based on their average performance on sentence encoding from [sbert.net](https://www.sbert.net). Model training is where feature extraction happens. The process of **feature extraction** is centered around the specified `SentenceTransformer` model, which is used to encode textual data into dense numerical vectors (embeddings). Following is a detailed explanation of how feature extraction is done:

1. **Input Data**:
- The input to the model consists of two columns: `candidate_combined` (the candidate's answer) and `ai_combined` (the AI-generated answer). These represent the two pieces of text whose similarity will be compared.
- The `similarity_score` is the label, representing how similar the two pieces of text are, which the model learns to predict during training.

2. **Creating Examples for Training**:
- The line `InputExample(texts=[row['candidate_combined'], row['ai_combined']], label=float(row['similarity_score']))` creates training examples for the model.
- `texts` is a pair of texts that will be encoded into numerical vectors (embeddings) by the `SentenceTransformer` model. These embeddings represent the features extracted from the text data.
- These `InputExample`s are then passed into a `DataLoader`, which prepares batches of data for training.

3. **SentenceTransformer Model**:
- The core feature extraction happens when the `SentenceTransformer` is initialized. This model is pre-trained on large corpora and can convert input texts into high-dimensional vectors (embeddings).
- When the training data is passed through the model, it encodes each text (from both `candidate_combined` and `ai_combined`) into a fixed-size embedding. These embeddings are vector representations of the text that capture semantic meaning, making them suitable for downstream tasks like similarity measurement.

4. **Cosine Similarity Loss**:
- The `CosineSimilarityLoss` is used as the loss function for training. The model learns to minimize the cosine distance between embeddings of semantically similar texts (texts with higher `similarity_score`) and maximize the distance for dissimilar ones.
- This process adjusts the model's weights to better encode the features that represent textual similarity.

5. **Validation and Evaluation**:
- For validation, the code prepares examples similarly, but these are used for evaluation instead of training.
- The `EmbeddingSimilarityEvaluator` computes the similarity between the embeddings of `candidate_combined` and `ai_combined` using their cosine similarity, and compares it with the actual `similarity_score`.

6. **How Features Are Encoded**:
- Each piece of text (both `candidate_combined` and `ai_combined`) is passed through the `SentenceTransformer` model.
- The model tokenizes the text, then converts it into a dense embedding vector of fixed length. These embeddings encode semantic information about the text.
- The embeddings are the "features" extracted from the text, which are then used to compute similarity.

The **features** in this code are the dense embeddings extracted by the `SentenceTransformer` model. These embeddings are used to train the model to learn similarities between pairs of text using the cosine similarity loss function.

All of the 5 models were trained for `5` epochs with a varying batch size from `4` to `16`. The detailed model training documentation can be found in [`train.ipynb`](/notebooks/train.ipynb) Jupyter Notebook or [`train.py`](/scripts/train.py) file. The fine-tuned model is saved at [`models`](/models/) directory.

## Model Evaluation

Following are the 8 different metrics employed for evaluating the 5 fine-tuned models:



Metric
all-mpnet-base-v2
all-distilroberta-v1
all-MiniLM-L12-v2
all-MiniLM-L6-v2
multi-qa-mpnet-base-dot-v1




Cosine Spearman
0.9508
0.9519
0.8966
0.9
0.9672


Manhattan Spearman
0.95
0.9477
0.8925
0.8931
0.9603


Euclidean Spearman
0.9508
0.9519
0.8966
0.9
0.9551


Dot Product Spearman
0.9508
0.9519
0.8966
0.9
0.9652


Mean Squared Error
0.0063
0.0056
0.0257
0.0165
0.0086


Root Mean Squared Error
0.0794
0.0749
0.1602
0.1284
0.0925


Mean Absolute Error
0.0583
0.0534
0.0954
0.0880
0.0702


R-squared Score
0.9119
0.9215
0.6412
0.7696
0.8805

From the above metrics, we can derive several insights about the performance of each model across different evaluation criteria. Let us see one by one:

1. **all-mpnet-base-v2**:
- Generally performs well in all metrics, especially with Spearman correlation metrics (Cosine, Manhattan, Euclidean, Dot Product), showing consistency across different similarity measures.
- Has a slightly lower Mean Squared Error (0.0063), indicating good predictive performance.
- RMSE and MAE are low compared to most other models, and the R-squared score (0.9119) reflects a strong goodness-of-fit.

2. **all-distilroberta-v1**:
- Slightly outperforms all-mpnet-base-v2 in most Spearman correlations, showing excellent alignment with ground-truth similarities.
- Exhibits the lowest Mean Squared Error (0.0056) and RMSE (0.0749), suggesting this model makes fewer errors in prediction.
- It also has the highest R-squared score (0.9215), indicating it captures the most variance and performs very well across the board.

3. **all-MiniLM-L12-v2**:
- Performs relatively poorly in comparison to the other models, with lower Spearman correlations (around 0.89–0.90) and much higher error metrics.
- Has the highest Mean Squared Error (0.0257), RMSE (0.1602), and MAE (0.0954), showing that this model's predictions are less accurate.
- Its R-squared score is the lowest (0.6412), implying a weaker fit to the data.

4. **all-MiniLM-L6-v2**:
- Performs similarly to MiniLM-L12, but with slightly better error metrics, though still worse than most other models.
- Has moderate Spearman correlations (around 0.89–0.9), but significantly higher errors (MSE = 0.0165, RMSE = 0.1284, MAE = 0.0880) than top-performing models.
- Its R-squared score (0.7696) is better than L12 but still indicates room for improvement.

5. **multi-qa-mpnet-base-dot-v1**:
- This model shows the highest performance in Spearman correlation metrics, particularly in Cosine (0.9672), Dot Product (0.9652), and Manhattan (0.9603) similarity, suggesting it captures similarity relationships very well.
- Though it has higher errors (MSE = 0.0086, RMSE = 0.0925) than distilroberta, they are still reasonable.
- With an R-squared score of 0.8805, it demonstrates strong predictive power and is one of the top-performing models overall.

### Summary:
- **all-distilroberta-v1** consistently performs best in terms of error metrics (MSE, RMSE, MAE) and variance explained (R-squared).
- **multi-qa-mpnet-base-dot-v1** excels in similarity-based evaluations (Spearman correlations), making it highly effective for tasks requiring strong semantic understanding.
- **all-mpnet-base-v2** offers a balanced performance across both error and similarity metrics.
- **all-MiniLM-L12-v2 and L6-v2** are the weaker models in this comparison, especially in terms of error metrics and R-squared scores.

The detailed evaluation process can be found in [`evaluation.ipynb`](/notebooks/evaluation.ipynb) Jupyter Notebook or [`evaluation.py`](/scripts/evaluation.py) file.

# Model Deployment

## Model Export and Compression

From the model evaluation, it is evident that overall the fine-tuned `all-distilroberta-v1` model is relatively the best performing model. So, I decided to deploy this model.

For deployment, we need to convert this model to **ONNX (Open Neural Network Exchange)** format. ONNX optimizes performance during inference and is supported by a variety of hardware, such as CPUs and GPUs, making it ideal for real-world deployment. Exporting models to ONNX simplifies deployment and ensures models can run efficiently across different systems. It also enables cross-platform compatibility, allowing models trained in one framework to be deployed in another.

We must also quantize ONNX models before deploying. Quantizing reduces their size, speeds up inference, and lowers memory usage, all with minimal loss of accuracy. This combination of ONNX and quantization is particularly significant for deploying models on resource-constrained platforms.

The detailed model export and compression documentation can be found in [`export.ipynb`](/notebooks/export.ipynb) Jupyter Notebook or [`export.py`](/scripts/export.py) file. The exported and quantized model is saved at [`models`](/models/) directory.

## HuggingFace Deployment

The quantized `all-distilroberta-v1` model was deployed to [HuggingFace](https://huggingface.co/spaces/zzarif/AI-Detector) spaces with Gradio. Following is a snapshot of the HuggingFace app.

![HF deployment](/deployment/huggingface/hf_deployment.png)

The HuggingFace deployment source code can be found in the [`huggingface`](/deployment/huggingface/) directory of this repository.

## Flask Web Deployment

A custom [Web Application](https://ai-code-detector.vercel.app/) was developed with Flask to demonstrate the AI detection model's capability. It uses the HuggingFace Spaces API in the backend. Following are some snapshots of the Flask webapp:

![Flask Web App](/deployment/flask_app/ai_detector_home.png)

![Flask Web App](/deployment/flask_app/ai_detector_result.png)

The Flask web app deployment source code can be found in the [`flask_app`](/deployment/flask_app/) directory of this repository.

# Solution 2: Large Language Model

This method uses OpenAI's `GPT-4o`'s Natural Language Processing (NLP) capabilities to understand the context of the coding question and both the candidate's answer and AI-generated answer. It performs a semantic comparison to predict how closely the answers match based on the meaning and structure of the code. Here's how it works:

## Input Handling

The code loads three inputs from text files:

- [`question.txt`](/llm/question.txt) (the coding question).
- [`candidate_answer.txt`](/llm/candidate_answer.txt) (the human-written answer).
- [`ai_answer.txt`](/llm/ai_answer.txt) (the AI-generated answer).

These files contain the necessary test text inputs for the evaluation. Modify the contents of these files for testing on your own data.

## System and Human Prompts

- The system prompt is used to instruct `GPT-4o` on its role as a "code similarity evaluator." It defines the task as comparing two answers and returning a similarity score as a floating point number between 0 and 1.
- The human prompt consists of the actual coding question, candidate's answer, and AI-generated answer. `GPT-4o` uses these inputs to assess the similarity.

## `GPT-4o` Prediction

- The text inputs are processed using **LangChain**'s `RunnableSequence`, combining the system prompt and user inputs.
- `GPT-4o` then evaluates how similar the candidate’s answer is to the AI-generated answer and produces a floating-point similarity score between 0 (completely different) and 1 (exact match).

## Model Output

- The `GPT-4o` model's output, which is an `AIMessage` object, contains the similarity score.
- The script extracts the score from the response and prints it as the final result.

The detailed source code can be found in the [`llm`](/llm/) directory of this repository.

# Build from Source

1. Clone the repo

```bash
git clone https://github.com/zzarif/AI-Detector.git
cd AI-Detector/
```

2. Initialize and activate virtual environment

```bash
virtualenv venv
source venv/Scripts/activate
```

3. Install dependencies

```bash
pip install -r requirements.txt
```

_Note: Select virtual environment interpreter from_ `Ctrl`+`Shift`+`P`

4. Preprocess the Data

Run all the cells in [`preprocessing.ipynb`](/notebooks/preprocessing.ipynb) Jupyter Notebook or run the following script:

```bash
python scripts/preprocessing.py
```

5. Train the model

Run all the cells in [`train.ipynb`](/notebooks/train.ipynb) Jupyter Notebook or run the following script (with specified model and hyperparameters):

```bash
python scripts/train.py --model all-MiniLM-L6-v2 --epochs 5 --batch_size 16
```

6. Evaluate the model

Run all the cells in [`evaluation.ipynb`](/notebooks/evaluation.ipynb) Jupyter Notebook or run the following script (with the specified fine-tuned model):

```bash
python scripts/evaluation.py --ft_model all-MiniLM-L6-v2
```

7. Perform model inference

Run all the cells in [`inference.ipynb`](/notebooks/inference.ipynb) Jupyter Notebook or run the following script (with the specified fine-tuned model):

```bash
python scripts/inference.py --ft_model all-MiniLM-L6-v2
```

8. Export model for deployment

Run all the cells in [`export.ipynb`](/notebooks/export.ipynb) Jupyter Notebook or run the following script (with the specified fine-tuned model):

```bash
python scripts/export.py --ft_model all-MiniLM-L6-v2
```

9. Predict similarity with LLM

Create a `.env` file in the [`llm`](/llm/) directory and insert the following line:

```bash
OPENAI_API_KEY=
```

_Note: Replace `` with your own API key from [OpenAI API Keys](https://platform.openai.com/settings/profile?tab=api-keys) page._

Then run the following script:

```bash
python llm/main.py
```

_Note: Modify the contents of the `.txt` files to test the model's capability on your own data._

# Conclusion and Future Works

In this comprehensive project, I dealt with similarity predicting **Regression** task. I had to develop a Machine Learning model that can detect potential AI use by comparing candidate coding answers to responses generated by AI models (GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo). The model can predict AI-detected score ranging in floating point numbers from `0` (no AI detected) to `1` (a lot of AI-detected).

In this project's context, I presented **two** different solutions for solving the task at hand. [First solution](#solution-1-fine-tuning-sentence-transformers) was fine-tuning multiple Sentence Transformers on the provided dataset. From the [model evaluation](#model-evaluation) it was evident that all of the models performed significantly well on the validation dataset. Since, the fine-tuned `all-distilroberta-v1` had relatively the best performance from the models, it was exported and quantized for deployment. The final quantized model was deployed to [HuggingFace](#huggingface-deployment) and integrated with a custom [Flask Web App](#flask-web-deployment).

In the [second solution](#solution-2-large-language-model), I used OpenAI's `GPT-4o`'s Natural Language Processing (NLP) capabilities to understand the context of the coding question and both the candidate's answer and AI-generated answer. It performs a semantic comparison to predict how closely the answers match based on the meaning and structure of the code.

Due to time constraints, I couldn't experiment further. For instance, I deployed fine-tuned `all-distilroberta-v1` since it had relatively the best performance overall. But, models like `all-mpnet-base-v2` and `multi-qa-mpnet-base-dot-v1` were fairly good candidates for deployment as well. I will export, quantize, and deploy these models as my future work. Also, I could employ more advanced techniques to solve the task at hand. For instance, I could develop another `SimilarityPredictor` on top of the SBERT model that could find a pattern between the candidate coding answer to its respective similarity prediction. This could potentially allow the `SimilarityPredictor` to predict similarity score directly from the candidate's coding answer without needing an equivalent AI-generated answer as reference. I will add this to my future work as well.

Finally, I used a commercial LLM (OpenAI's `GPT-4o`) for my second solution. Due to time constraints, I couldn't further experiment with open-source options like `Llama3`, `Llama2`, `Mistral3`, etc.

# Miscellaneous

The [`utils`](/utils/) directory contains some helper scripts. For example, [`file_extension_resolver.py`](/utils/file_extension_resolver.py) script parses the original dataset and creates the programming language to extension mapping dictionary used in [`preprocessing.py`](/scripts/preprocessing.py) file.

# Contact

[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?logo=linkedin&logoColor=white)](https://www.linkedin.com/in/zibran-zarif-amio-b82717263/) [![Mail](https://img.shields.io/badge/Gmail-EA4335?logo=gmail&logoColor=fff)](mailto:zibran.zarif.amio@gmail.com)

Thank you so much for your interest. Would love your valuable feedback!