An open API service indexing awesome lists of open source software.

https://github.com/theelderemo/pentesting-explanations

A high-quality supervised fine-tuning dataset for penetration testing expertise, red team tradecraft, and - as the dataset matures - novel vulnerability research and zero-day reasoning. The dataset is structured to teach models how to think like offensive security practitioners, not merely recall labels or technique names.
https://github.com/theelderemo/pentesting-explanations

cybersecurity dataset hacking hacking-tool huggingface machine-learning mitre-attack offensive-security penetration-testing red-team supervised-learning

Last synced: 28 days ago
JSON representation

A high-quality supervised fine-tuning dataset for penetration testing expertise, red team tradecraft, and - as the dataset matures - novel vulnerability research and zero-day reasoning. The dataset is structured to teach models how to think like offensive security practitioners, not merely recall labels or technique names.

Awesome Lists containing this project

README

          

# Pentesting Explanations - Adversarial Reasoning & Vulnerability Research

**A supervised fine-tuning dataset for adversarial reasoning, penetration testing expertise, and vulnerability research.**

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![HuggingFace Dataset](https://img.shields.io/badge/πŸ€—%20Dataset-theelderemo%2Fpentesting--explanations-yellow)](https://huggingface.co/datasets/theelderemo/pentesting-explanations)
[![Rows](https://img.shields.io/badge/rows-5%2C906-green)]()
[![Tasks](https://img.shields.io/badge/tasks-SFT%20%7C%20GRPO%20%7C%20PRM-orange)]()

Most cybersecurity datasets teach models to *recognize* known things: given a technique name, output its description; given a scenario, classify the attack vector. This is label memorization, and it produces models that fail the moment they encounter an unfamiliar codebase, a novel vulnerability class, or a non-textbook attack chain.

This dataset is built around a different objective: teaching models to reason through offensive security problems the way an expert practitioner does. Every row is designed to produce genuine deliberation. The `think` column is a live reasoning trace, option-by-option and hypothesis-by-hypothesis, written from the attacker's perspective, with dead ends included.

The long-term goal is to train models capable of genuine adversarial reasoning: hypothesis formation from unfamiliar code, data-flow tracing, variant hunting across patch history, and exploit primitive construction.

> **Dataset hosted on Hugging Face:** [theelderemo/pentesting-explanations](https://huggingface.co/datasets/theelderemo/pentesting-explanations)

**This dataset is not:**
- A defensive or blue team dataset. Every question and reasoning trace is written from the attacker's perspective.
- A detection or mitigation dataset. Questions never ask how to detect, alert on, or remediate techniques.
- A label memorization dataset. The goal is never "what is this called" it is always "how does an operator think through this decision."

## Quick Start

```python
from datasets import load_dataset

# Full dataset
ds = load_dataset("theelderemo/pentesting-explanations")

# HackTricks only
ds = load_dataset("theelderemo/pentesting-explanations", config_name="hacktricks")

# MITRE ATT&CK only
ds = load_dataset("theelderemo/pentesting-explanations", config_name="mitre_attack")

# Isolated CoT for process reward / GRPO
think_only = ds["train"]["think"]

# SFT with apply_chat_template
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your-model")
for row in ds["train"]:
formatted = tokenizer.apply_chat_template(row["messages"], tokenize=False)
```
## Schema

All shards share this schema. MCQ columns will be `null` for future shards using open-ended formats.

| Column | Type | Description |
|---|---|---|
| `question` | string | Multiple-choice question, framed from attacker perspective |
| `choices` | list[str] | Four answer options (A–D). Distractors use real tools/commands in incorrect contexts |
| `answer_idx` | int | Zero-based index of correct answer (0–3) |
| `correct_letter` | string | Letter of correct answer (A, B, C, or D) |
| `correct_choice` | string | Full text of the correct answer option |
| `explanation` | string | Expert explanation: correct answer justification + per-option debunking, attacker perspective |
| `prompt` | string | Full formatted prompt (system context + question + options) |
| `response` | string | Bolded answer header + full explanation |
| `think` | string | Isolated CoT deliberation. Option-by-option reasoning. Minimum 150 words. No answer restatement. |
| `messages` | list[dict] | SFT-ready `[{"role": "user", ...}, {"role": "assistant", "content": "......"}]` |

The `think` field is deliberately separated from `response` so process reward models can supervise the reasoning trace independently of the final answer.

## Dataset Structure

The dataset uses a **one-source-per-parquet-file** design. Each data source lives in its own numbered shard so you can load exactly what you need without filter logic.

```python
# Select specific shards
ds = load_dataset("theelderemo/pentesting-explanations", data_files={
"train": ["data/train-00000.parquet", "data/train-00001.parquet"]
})
```

### Current Sources

Click to expand - train-00000

#### `train-00000` - HackTricks + Base Eval (3,228 rows)

Built on [preemware/pentesting-eval](https://huggingface.co/datasets/preemware/pentesting-eval) and augmented with [HackTricks Wiki](https://github.com/HackTricks-wiki/hacktricks), processed into 5,404 cleaned Markdown chunks across 126 technical domains:

- Active Directory attacks (Kerberoasting, AS-REP Roasting, Pass-the-Hash, DCSync, ADCS abuse, ACL/delegation attacks)
- Web application exploitation (SQLi, XSS, SSRF, XXE, IDOR, deserialization, JWT attacks, OAuth abuse)
- Linux privilege escalation (SUID/SGID, capabilities, cron, container escapes)
- Windows privilege escalation (token impersonation, service misconfigurations, AlwaysInstallElevated)
- Network attacks (LLMNR/NBT-NS poisoning, SMB relay, Kerberos attacks)
- Cloud misconfigurations and exploitation paths (AWS, Azure, GCP)
- Malware analysis (static/dynamic, sandbox evasion, unpacking)
- Mobile security (Android, iOS)
- Network services (FTP, SSH, SMTP, SNMP, RDP, WinRM)
- Cryptographic attacks

Questions are generated with misconception-based distractors, wrong options that use real tools, real commands, and real techniques, just incorrect for the specific context being tested.

Click to expand - train-00001

#### `train-00001` - MITRE ATT&CK Enterprise + Mobile + ICS (2,678 rows)

Built from [mitre/cti](https://github.com/mitre/cti) STIX bundles (ATT&CK version sourced at generation time). All revoked and deprecated techniques excluded.

| Domain | Techniques |
|---|---|
| Enterprise | 691 |
| Mobile | 124 |
| ICS / OT | 79 |
| **Total** | **894** |

Three question angles per technique:

| Angle | What it tests |
|---|---|
| `offensive-mechanics / how-it-works` | What the technique does and how an attacker executes it |
| `operator-tradecraft / tool-and-command` | Specific tooling, commands, flags, and payloads |
| `privilege-and-platform / preconditions` | Required access level, target OS, environment preconditions |

Every question is grounded in real-world procedure examples pulled from ATT&CK STIX bundles via `mitreattack-python`, so distractors reference actual threat actor tooling (Mimikatz, Cobalt Strike, Impacket, CrackMapExec, BloodHound, etc.) in plausible-but-incorrect contexts.

## Planned Shards

The dataset has a clear trajectory toward training models capable of novel vulnerability discovery. Each planned shard targets a reasoning primitive currently absent from public security training data.

Vulnerability Research & Exploit Development

| Shard | Content |
|---|---|
| `train-00002` | CVE patch diff analysis - root cause + variant hunting reasoning |
| `train-00003` | OSS-Fuzz source code audit traces (input boundary β†’ data flow β†’ primitive identification) |
| `train-00004` | Exploit primitive β†’ weaponization reasoning (UAF/OOB/type confusion β†’ heap grooming β†’ ROP chains) |
| `train-00005` | Browser & renderer exploit chains (JIT bugs, V8/SpiderMonkey, sandbox escapes) |
| `train-00006` | Kernel exploitation reasoning - LPE primitives, race conditions, KASLR/SMEP/SMAP bypass logic |

CTF & Competition Reasoning

| Shard | Content |
|---|---|
| `train-00007` | CTF pwn reasoning chains |
| `train-00008` | CTF web exploitation reasoning chains |
| `train-00009` | CTF reversing & binary analysis chains |

Threat Intelligence & Adversary Simulation

| Shard | Content |
|---|---|
| `train-00010` | APT campaign tradecraft - actor-specific toolchain decisions, campaign sequencing, OPSEC reasoning |
| `train-00011` | Ransomware operator playbooks & affiliate tradecraft |
| `train-00012` | State-sponsored implant & C2 framework analysis |

Active Directory & Enterprise Network

| Shard | Content |
|---|---|
| `train-00013` | Active Directory attack chains - Kerberoasting, AS-REP, ADCS ESC1–ESC13, ACL abuse, delegation |
| `train-00014` | LOLBAS / LOLDrivers / GTFOBins operational reasoning |
| `train-00015` | Cloud attack paths - AWS, Azure, GCP IAM privesc and cross-service pivots |

Web Application & API Exploitation

| Shard | Content |
|---|---|
| `train-00016` | PayloadsAllTheThings structured exploitation reasoning |
| `train-00017` | Bug bounty root cause reasoning - HackerOne disclosed reports |
| `train-00018` | Web cache poisoning, HTTP desync & request smuggling |
| `train-00019` | OAuth, OIDC & SSO attack reasoning |

Malware Analysis, ICS/OT, Embedded, and More

| Shard | Content |
|---|---|
| `train-00020` | Malware analysis reasoning - dynamic + static |
| `train-00021` | Obfuscation & packer analysis chains |
| `train-00022` | ICS/SCADA attack reasoning (PLC logic abuse, HMI pivots, Industroyer/TRITON analyses) |
| `train-00023` | Firmware analysis & embedded exploitation |
| `train-00024` | ired.team operator notes - process injection, AV evasion, OPSEC tradecraft |
| `train-00025` | Proving Grounds / HTB retired machine reasoning chains |

As the dataset matures, the MCQ scaffolding is gradually replaced by open-ended vulnerability research tasks with no labeled answer choices - only a reasoning process and a conclusion, mirroring the actual cognitive structure of zero-day discovery.

## Intended Use

**Primary use cases:**
- Supervised fine-tuning for penetration testing and red team LLMs
- Training adversarial reasoning and systematic distractor elimination
- Process reward model training using the isolated `think` column (GRPO, DPO, RLHF)
- Building autonomous vulnerability research agents
- Security certification preparation (OSCP, OSED, GREM, GPEN, GXPN)
- Threat emulation and adversary simulation training
**Responsible use:** This dataset is intended for legitimate security research, penetration testing education, and the development of defensive AI tools. All techniques are documented in public sources (MITRE ATT&CK, HackTricks, academic research). Users are responsible for ensuring their use complies with applicable laws and ethical guidelines.

## Tools & Validation

The repository includes a suite of tools for dataset validation and format conversion:

Click to expand

```bash
# Install dependencies
make install

# Run all validation checks
make validate

# Individual validation tools
make validate-schema # Check parquet schema compliance
make validate-think # Validate reasoning trace quality
make duplicates # Detect near-duplicate questions
make metrics # Generate quality metrics report

# Format conversion
make convert # Convert parquet to JSON
```

### Available Scripts

Located in `tools/scripts/`:

| Script | Purpose |
|---|---|
| `validate_schema.py` | Verify parquet files conform to dataset schema |
| `validate_think_column.py` | Score reasoning traces on quality metrics (word count, option analysis, LLM artifacts, perspective) |
| `duplicate_detector.py` | Find similar/duplicate questions across shards using fuzzy matching |
| `quality_metrics.py` | Generate comprehensive quality reports (row counts, text lengths, completeness %) |
| `parquet_to_json.py` | Convert parquet shards to JSON for analysis |
| `csv_to_dataset_row.py` | Validate and convert CSV rows to dataset format |
| `validate_and_convert.py` | Combined validation + conversion tool |

### Usage Examples

```bash
# Validate all shards
python tools/scripts/validate_schema.py data/

# Check think column quality (threshold 0.7)
python tools/scripts/validate_think_column.py data/ --threshold 0.7

# Find duplicates with verbose output
python tools/scripts/duplicate_detector.py data/ --verbose

# Generate metrics report and export to JSON
python tools/scripts/quality_metrics.py data/ --export metrics.json

# Convert parquet to JSON
python tools/scripts/parquet_to_json.py data/

# Validate and convert CSV in one step
python tools/scripts/validate_and_convert.py --csv new_rows.csv --json output.json

# Validate existing parquet file
python tools/scripts/validate_and_convert.py --validate data/train-00000.parquet
```

## Contributing

Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## Citation \& Acknowledgments

```bibtex
@dataset{theelderemo_pentesting_explanations_2026,
author = { Christopher Dickinson },
title = { pentesting-explanations },
year = 2026,
url = { https://huggingface.co/datasets/theelderemo/pentesting-explanations },
doi = { 10.57967/hf/8471 },
publisher = { Hugging Face }
}
```

**HackTricks** - Special thanks to Carlos Polop and the entire HackTricks community for building and maintaining one of the most comprehensive open-source cybersecurity knowledge bases available. The HackTricks Wiki is the backbone of `train-00000`. [github.com/HackTricks-wiki/hacktricks](https://github.com/HackTricks-wiki/hacktricks)

**MITRE ATT\&CK** - `train-00001` is built on MITRE ATT\&CK STIX data from the [mitre/cti](https://github.com/mitre/cti) repository, licensed under Apache 2.0. ATT\&CK is a globally accessible knowledge base of adversary tactics and techniques based on real-world observations. [attack.mitre.org](https://attack.mitre.org)

**mitreattack-python** - Procedure example and sub-technique extraction powered by the [mitreattack-python](https://github.com/mitre-attack/mitreattack-python) library.

**Base benchmark** - The original evaluation set that seeded `train-00000` is courtesy of [preemware/pentesting-eval](https://huggingface.co/datasets/preemware/pentesting-eval).