https://github.com/scthornton/securecode

Unified security training dataset (2,185 examples) covering OWASP Top 10 2021 and OWASP LLM Top 10 2025
https://github.com/scthornton/securecode

ai-security huggingface owasp secure-coding security-dataset training-data web-security

Last synced: 5 days ago
JSON representation

Unified security training dataset (2,185 examples) covering OWASP Top 10 2021 and OWASP LLM Top 10 2025

Host: GitHub
URL: https://github.com/scthornton/securecode
Owner: scthornton
License: other
Created: 2026-02-10T03:23:28.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-03-25T14:36:41.000Z (3 months ago)
Last Synced: 2026-06-04T23:23:00.134Z (22 days ago)
Topics: ai-security, huggingface, owasp, secure-coding, security-dataset, training-data, web-security
Language: Python
Size: 5.86 KB
Stars: 3
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md

Awesome Lists containing this project

README

          # SecureCode

**Comprehensive security training dataset for AI coding assistants — 2,185 examples covering both traditional web security and AI/ML security.**

Built by [perfecXion.ai](https://perfecxion.ai).

## Dataset Family

| Dataset | Examples | Focus | HuggingFace | GitHub |

|---------|----------|-------|-------------|--------|

| **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) | This repo |

| SecureCode v2 | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-v2](https://huggingface.co/datasets/scthornton/securecode-v2) | [securecode-v2](https://github.com/scthornton/securecode-v2) |

| SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) | [securecode-aiml](https://github.com/scthornton/securecode-aiml) |

## Quick Start

```python

from datasets import load_dataset

# Load everything (2,185 examples)

dataset = load_dataset("scthornton/securecode")

# Load only web security (1,435 examples)

web = load_dataset("scthornton/securecode", "web")

# Load only AI/ML security (750 examples)

aiml = load_dataset("scthornton/securecode", "aiml")

```

## What's In It

Every example is a 4-turn conversation between a developer and an AI coding assistant. The developer asks how to build something, and the assistant provides a vulnerable implementation, explains why it's dangerous, shows a secure alternative with 5+ defense layers, and then covers testing, monitoring, and common mistakes.

**Web Security (1,435 examples):** SQL injection, XSS, authentication bypass, SSRF, cryptographic failures, and more across 12 programming languages and 9 web frameworks (Express.js, Django, Spring Boot, Flask, Rails, Laravel, ASP.NET Core, FastAPI, NestJS).

**AI/ML Security (750 examples):** Prompt injection, model poisoning, embedding manipulation, system prompt leakage, excessive agent autonomy, and more across 30+ AI/ML frameworks (LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, vLLM, CrewAI, AutoGen, etc.).

## Unified Schema

All conversations use a normalized `{role, content}` format:

```json

{

  "id": "example-id",

  "metadata": { "category": "...", "severity": "CRITICAL", "cwe": "CWE-79", "lang": "python" },

  "context": { "description": "...", "impact": "..." },

  "conversations": [

    {"role": "human", "content": "How do I build secure JWT auth?"},

    {"role": "assistant", "content": "Here's the vulnerable version... here's the secure version..."},

    {"role": "human", "content": "How do I test this?"},

    {"role": "assistant", "content": "Here's how to test, monitor, and avoid common mistakes..."}

  ],

  "quality_score": null,

  "security_assertions": [],

  "references": []

}

```

## Building the Unified Dataset

The unified dataset is built from the two source datasets using a normalization script that converts v2.x conversations from `{turn, from, value}` to `{role, content}` format.

```bash

python3 scripts/build_unified_dataset.py

```

This generates `unified-data/data/web/` (1,435 files) and `unified-data/data/aiml/` (750 files), ready to push to HuggingFace.

## Configs

| Config | Examples | OWASP Standard |

|--------|----------|----------------|

| `default` | 2,185 | Both |

| `web` | 1,435 | OWASP Top 10 2021 |

| `aiml` | 750 | OWASP LLM Top 10 2025 |

## Citation

```bibtex

@misc{thornton2025securecode,

  title={SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models},

  author={Thornton, Scott},

  year={2025},

  publisher={perfecXion.ai},

  url={https://huggingface.co/datasets/scthornton/securecode-v2},

  note={arXiv:2512.18542}

}

@dataset{thornton2026securecodeaiml,

  title={SecureCode AI/ML: AI/ML Security Training Dataset for the OWASP LLM Top 10 2025},

  author={Thornton, Scott},

  year={2026},

  publisher={perfecXion.ai},

  url={https://huggingface.co/datasets/scthornton/securecode-aiml}

}

```

## License

- **Web examples:** CC BY-NC-SA 4.0

- **AI/ML examples:** MIT

- **Unified dataset:** CC BY-NC-SA 4.0 (the more restrictive of the two)

---

## Contact

**Scott Thornton** — AI Security Researcher

- Website: [perfecxion.ai](https://perfecxion.ai/)

- Email: [scott@perfecxion.ai](mailto:scott@perfecxion.ai)

- LinkedIn: [linkedin.com/in/scthornton](https://www.linkedin.com/in/scthornton)

- ORCID: [0009-0008-0491-0032](https://orcid.org/0009-0008-0491-0032)

- GitHub: [@scthornton](https://github.com/scthornton)

**Security Issues**: Please report via [SECURITY.md](SECURITY.md)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scthornton/securecode

Awesome Lists containing this project

README