https://github.com/zer0contextlost/sentinalai

Last synced: 5 days ago
JSON representation

Host: GitHub
URL: https://github.com/zer0contextlost/sentinalai
Owner: zer0contextlost
Created: 2026-05-06T17:03:10.000Z (about 2 months ago)
Default Branch: master
Last Pushed: 2026-05-06T18:51:36.000Z (about 2 months ago)
Last Synced: 2026-05-06T19:39:07.995Z (about 2 months ago)
Language: Python
Size: 87.9 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# SentinalAI

Research codebase for AI-generated code detection via style fingerprinting.

## What This Is

An empirical study of whether stylometric features can reliably detect AI-generated
code. Includes a 65-feature extractor, corpus collection pipelines, and classifier
evaluation across two datasets — SemEval-2026 Task 13 (500K samples, 34 AI models)
and a novel paired Codeforces corpus (same problem, human vs. AI solution).

## Project Structure

```
features/ 65-feature extractor (lexical + Python AST)
collector/ corpus collection — SemEval pull, Codeforces AI gen, GitHub human scrape
scripts/ feature matrix build, perplexity validation, classifier training/eval
models/ baseline and paired corpus classifier training
api/ inference endpoint (stub)
```

## Datasets

**SemEval-2026 Task 13** — `DaniilOr/SemEval-2026-Task13` on HuggingFace
500K samples, 34 AI generators, Python/C++/Java

**Paired Codeforces corpus** — built by this repo
96 competitive programming problems each solved by a human (pre-2022 GitHub) and an AI (deepseek-coder:6.7b).
Collected via `collector/scrape_codeforces.py` and `collector/fetch_github_human_solutions.py`.

## Setup

```bash
pip install -r requirements.txt
# Requires Ollama running at localhost:11434 for AI generation and perplexity scoring
# ollama pull deepseek-coder:6.7b
```

## Running

```bash
# Pull SemEval corpus
python collector/pull_semeval_dataset.py

# Build feature matrix
python scripts/build_feature_matrix.py

# Train baseline on SemEval
python models/train_baseline.py

# Build paired corpus features
python scripts/build_paired_features.py

# Train on paired corpus (leave-one-problem-out CV)
python scripts/train_paired_classifier.py

# Cross-dataset generalization tests
python scripts/test_generalization.py
python scripts/test_generalization_reverse.py
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zer0contextlost/sentinalai

Awesome Lists containing this project

README