https://github.com/robcyberlab/ngram-similarity-engine

🤖Ngram Similarity Engine📚
https://github.com/robcyberlab/ngram-similarity-engine

code-analysis data-filtering data-science database-management feature-extraction jaccard-similarity machine-learning ngrams plagiarism-detection similarity-analysis sqlite

Last synced: 4 months ago
JSON representation

🤖Ngram Similarity Engine📚

Host: GitHub
URL: https://github.com/robcyberlab/ngram-similarity-engine
Owner: RobCyberLab
Created: 2024-11-16T10:36:31.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-11-16T20:36:45.000Z (8 months ago)
Last Synced: 2025-01-24T16:12:10.544Z (5 months ago)
Topics: code-analysis, data-filtering, data-science, database-management, feature-extraction, jaccard-similarity, machine-learning, ngrams, plagiarism-detection, similarity-analysis, sqlite
Language: Python
Homepage:
Size: 3.62 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🤖Ngram-Similarity-Engine📚

In this project, we will use extracted n-grams to build a database of features for a collection of programs.

Note: Due to privacy policies, I am not allowed to post the dataset publicly.

---

## Table of Contents📑
1. [Introduction](#introduction)
2. [Building the SQLite Database](#building-the-sqlite-database)
3. [Filtering Frequent N-Grams](#filtering-frequent-n-grams)
4. [Implemented Features](#implemented-features)
5. [Similarity Calculation](#similarity-calculation)
6. [Analysis and Conclusions](#analysis-and-conclusions)

---

## Introduction📘

In this project, we will create and analyze SQLite databases that store n-grams extracted from student files. The goal is to apply methods for storage, filtering, and similarity analysis to detect patterns and relationships between programs.

---

## Building the SQLite Database🛠️

1. **SQLite Database `raw.db`**
Contains a `Homeworks` table with the following structure:
- `Hash` - the file hash (MD5, SHA-1, or SHA-256)
- `Assign` - the assignment number
- `Student` - the student's identifier
- `Ngrams` - a blob containing a sorted list of extracted n-grams. Each n-gram is represented as an unsigned 32-bit integer.

---

## Filtering Frequent N-Grams🗂️

2. **SQLite Database `features.db`**
Based on `raw.db`, this database is built with the same structure but excludes n-grams that appear in more than `T` files (where `T = 30` is suggested).

---

## Implemented Features🧩

3. **Functions**:
- `sim1(db, h1, h2)`
Calculates the Jaccard similarity based on two provided hashes.
- `sim2(db, assign, s1, s2)`
Calculates the Jaccard similarity based on an assignment number and two student identifiers.
- Returns `0` if one of the students does not exist in the database.

---

## Similarity Calculation📊

4. **For each assignment**:
- Calculate the similarity between all pairs of submissions.
- Create a top-500 list of the most similar pairs for each of the two databases (`raw.db` and `features.db`).

5. **Analyze source code**:
- Select 10 pairs of code from each top list for further analysis.

---

## Analysis and Conclusions🔍

- The analysis of similar pairs provides insights into potential common patterns or plagiarism among students.
- Using n-grams and the optimized database (`features.db`) helps reduce noise caused by frequently used elements.

---

### Instructions for Use💾
1. Build the `raw.db` database using the initial collection of files.
2. Apply filtering to create the `features.db` database.
3. Implement the `sim1` and `sim2` functions.
4. Calculate and analyze similarities according to the requirements.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/robcyberlab/ngram-similarity-engine

Awesome Lists containing this project

README