https://github.com/theelderemo/pentesting-explanations

A high-quality supervised fine-tuning dataset for penetration testing expertise, red team tradecraft, and - as the dataset matures - novel vulnerability research and zero-day reasoning. The dataset is structured to teach models how to think like offensive security practitioners, not merely recall labels or technique names.
https://github.com/theelderemo/pentesting-explanations
cybersecurity dataset hacking hacking-tool huggingface machine-learning mitre-attack offensive-security penetration-testing red-team supervised-learning
Last synced: 28 days ago
JSON representation
Host: GitHub
URL: https://github.com/theelderemo/pentesting-explanations
Owner: theelderemo
License: apache-2.0
Created: 2026-04-22T17:02:38.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-04-23T08:56:15.000Z (2 months ago)
Last Synced: 2026-04-23T10:33:38.378Z (2 months ago)
Topics: cybersecurity, dataset, hacking, hacking-tool, huggingface, machine-learning, mitre-attack, offensive-security, penetration-testing, red-team, supervised-learning
Language: Python
Homepage: https://huggingface.co/datasets/theelderemo/pentesting-explanations
Size: 14.8 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project

README

          # Pentesting Explanations - Adversarial Reasoning & Vulnerability Research

**A supervised fine-tuning dataset for adversarial reasoning, penetration testing expertise, and vulnerability research.**

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

[![HuggingFace Dataset](https://img.shields.io/badge/🤗%20Dataset-theelderemo%2Fpentesting--explanations-yellow)](https://huggingface.co/datasets/theelderemo/pentesting-explanations)

[![Rows](https://img.shields.io/badge/rows-5%2C906-green)]()

[![Tasks](https://img.shields.io/badge/tasks-SFT%20%7C%20GRPO%20%7C%20PRM-orange)]()

Most cybersecurity datasets teach models to *recognize* known things: given a technique name, output its description; given a scenario, classify the attack vector. This is label memorization, and it produces models that fail the moment they encounter an unfamiliar codebase, a novel vulnerability class, or a non-textbook attack chain.

This dataset is built around a different objective: teaching models to reason through offensive security problems the way an expert practitioner does. Every row is designed to produce genuine deliberation. The `think` column is a live reasoning trace, option-by-option and hypothesis-by-hypothesis, written from the attacker's perspective, with dead ends included.

 

The long-term goal is to train models capable of genuine adversarial reasoning: hypothesis formation from unfamiliar code, data-flow tracing, variant hunting across patch history, and exploit primitive construction.

> **Dataset hosted on Hugging Face:** [theelderemo/pentesting-explanations](https://huggingface.co/datasets/theelderemo/pentesting-explanations)

**This dataset is not:**  

- A defensive or blue team dataset. Every question and reasoning trace is written from the attacker's perspective.

- A detection or mitigation dataset. Questions never ask how to detect, alert on, or remediate techniques.

- A label memorization dataset. The goal is never "what is this called" it is always "how does an operator think through this decision."

## Quick Start

 

```python

from datasets import load_dataset

 

# Full dataset

ds = load_dataset("theelderemo/pentesting-explanations")

 

# HackTricks only

ds = load_dataset("theelderemo/pentesting-explanations", config_name="hacktricks")

 

# MITRE ATT&CK only

ds = load_dataset("theelderemo/pentesting-explanations", config_name="mitre_attack")

 

# Isolated CoT for process reward / GRPO

think_only = ds["train"]["think"]

 

# SFT with apply_chat_template

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("your-model")

for row in ds["train"]:

    formatted = tokenizer.apply_chat_template(row["messages"], tokenize=False)

```

## Schema

 

All shards share this schema. MCQ columns will be `null` for future shards using open-ended formats.

 

| Column | Type | Description |

|---|---|---|

| `question` | string | Multiple-choice question, framed from attacker perspective |

| `choices` | list[str] | Four answer options (A–D). Distractors use real tools/commands in incorrect contexts |

| `answer_idx` | int | Zero-based index of correct answer (0–3) |

| `correct_letter` | string | Letter of correct answer (A, B, C, or D) |

| `correct_choice` | string | Full text of the correct answer option |

| `explanation` | string | Expert explanation: correct answer justification + per-option debunking, attacker perspective |

| `prompt` | string | Full formatted prompt (system context + question + options) |

| `response` | string | Bolded answer header + full explanation |

| `think` | string | Isolated CoT deliberation. Option-by-option reasoning. Minimum 150 words. No answer restatement. |

| `messages` | list[dict] | SFT-ready `[{"role": "user", ...}, {"role": "assistant", "content": "......"}]` |

 

The `think` field is deliberately separated from `response` so process reward models can supervise the reasoning trace independently of the final answer.

## Dataset Structure

 

The dataset uses a **one-source-per-parquet-file** design. Each data source lives in its own numbered shard so you can load exactly what you need without filter logic.

 

```python

# Select specific shards

ds = load_dataset("theelderemo/pentesting-explanations", data_files={

    "train": ["data/train-00000.parquet", "data/train-00001.parquet"]

})

```

### Current Sources

 

Click to expand - train-00000

#### `train-00000` - HackTricks + Base Eval (3,228 rows)

 

Built on [preemware/pentesting-eval](https://huggingface.co/datasets/preemware/pentesting-eval) and augmented with [HackTricks Wiki](https://github.com/HackTricks-wiki/hacktricks), processed into 5,404 cleaned Markdown chunks across 126 technical domains:

 

- Active Directory attacks (Kerberoasting, AS-REP Roasting, Pass-the-Hash, DCSync, ADCS abuse, ACL/delegation attacks)

- Web application exploitation (SQLi, XSS, SSRF, XXE, IDOR, deserialization, JWT attacks, OAuth abuse)

- Linux privilege escalation (SUID/SGID, capabilities, cron, container escapes)

- Windows privilege escalation (token impersonation, service misconfigurations, AlwaysInstallElevated)

- Network attacks (LLMNR/NBT-NS poisoning, SMB relay, Kerberos attacks)

- Cloud misconfigurations and exploitation paths (AWS, Azure, GCP)

- Malware analysis (static/dynamic, sandbox evasion, unpacking)

- Mobile security (Android, iOS)

- Network services (FTP, SSH, SMTP, SNMP, RDP, WinRM)

- Cryptographic attacks

Questions are generated with misconception-based distractors, wrong options that use real tools, real commands, and real techniques, just incorrect for the specific context being tested.

 

Click to expand - train-00001

#### `train-00001` - MITRE ATT&CK Enterprise + Mobile + ICS (2,678 rows)

 

Built from [mitre/cti](https://github.com/mitre/cti) STIX bundles (ATT&CK version sourced at generation time). All revoked and deprecated techniques excluded.

 

| Domain | Techniques |

|---|---|

| Enterprise | 691 |

| Mobile | 124 |

| ICS / OT | 79 |

| **Total** | **894** |

 

Three question angles per technique:

 

| Angle | What it tests |

|---|---|

| `offensive-mechanics / how-it-works` | What the technique does and how an attacker executes it |

| `operator-tradecraft / tool-and-command` | Specific tooling, commands, flags, and payloads |

| `privilege-and-platform / preconditions` | Required access level, target OS, environment preconditions |

 

Every question is grounded in real-world procedure examples pulled from ATT&CK STIX bundles via `mitreattack-python`, so distractors reference actual threat actor tooling (Mimikatz, Cobalt Strike, Impacket, CrackMapExec, BloodHound, etc.) in plausible-but-incorrect contexts.

 

## Planned Shards

 

The dataset has a clear trajectory toward training models capable of novel vulnerability discovery. Each planned shard targets a reasoning primitive currently absent from public security training data.

 

Vulnerability Research & Exploit Development

| Shard | Content |

|---|---|

| `train-00002` | CVE patch diff analysis - root cause + variant hunting reasoning |

| `train-00003` | OSS-Fuzz source code audit traces (input boundary → data flow → primitive identification) |

| `train-00004` | Exploit primitive → weaponization reasoning (UAF/OOB/type confusion → heap grooming → ROP chains) |

| `train-00005` | Browser & renderer exploit chains (JIT bugs, V8/SpiderMonkey, sandbox escapes) |

| `train-00006` | Kernel exploitation reasoning - LPE primitives, race conditions, KASLR/SMEP/SMAP bypass logic |

CTF & Competition Reasoning

| Shard | Content |

|---|---|

| `train-00007` | CTF pwn reasoning chains |

| `train-00008` | CTF web exploitation reasoning chains |

| `train-00009` | CTF reversing & binary analysis chains |

Threat Intelligence & Adversary Simulation

| Shard | Content |

|---|---|

| `train-00010` | APT campaign tradecraft - actor-specific toolchain decisions, campaign sequencing, OPSEC reasoning |

| `train-00011` | Ransomware operator playbooks & affiliate tradecraft |

| `train-00012` | State-sponsored implant & C2 framework analysis |

Active Directory & Enterprise Network

| Shard | Content |

|---|---|

| `train-00013` | Active Directory attack chains - Kerberoasting, AS-REP, ADCS ESC1–ESC13, ACL abuse, delegation |

| `train-00014` | LOLBAS / LOLDrivers / GTFOBins operational reasoning |

| `train-00015` | Cloud attack paths - AWS, Azure, GCP IAM privesc and cross-service pivots |

 

Web Application & API Exploitation

| Shard | Content |

|---|---|

| `train-00016` | PayloadsAllTheThings structured exploitation reasoning |

| `train-00017` | Bug bounty root cause reasoning - HackerOne disclosed reports |

| `train-00018` | Web cache poisoning, HTTP desync & request smuggling |

| `train-00019` | OAuth, OIDC & SSO attack reasoning |

Malware Analysis, ICS/OT, Embedded, and More

| Shard | Content |

|---|---|

| `train-00020` | Malware analysis reasoning - dynamic + static |

| `train-00021` | Obfuscation & packer analysis chains |

| `train-00022` | ICS/SCADA attack reasoning (PLC logic abuse, HMI pivots, Industroyer/TRITON analyses) |

| `train-00023` | Firmware analysis & embedded exploitation |

| `train-00024` | ired.team operator notes - process injection, AV evasion, OPSEC tradecraft |

| `train-00025` | Proving Grounds / HTB retired machine reasoning chains | 

As the dataset matures, the MCQ scaffolding is gradually replaced by open-ended vulnerability research tasks with no labeled answer choices - only a reasoning process and a conclusion, mirroring the actual cognitive structure of zero-day discovery.

 

## Intended Use

 

**Primary use cases:**

- Supervised fine-tuning for penetration testing and red team LLMs

- Training adversarial reasoning and systematic distractor elimination

- Process reward model training using the isolated `think` column (GRPO, DPO, RLHF)

- Building autonomous vulnerability research agents

- Security certification preparation (OSCP, OSED, GREM, GPEN, GXPN)

- Threat emulation and adversary simulation training

**Responsible use:** This dataset is intended for legitimate security research, penetration testing education, and the development of defensive AI tools. All techniques are documented in public sources (MITRE ATT&CK, HackTricks, academic research). Users are responsible for ensuring their use complies with applicable laws and ethical guidelines.

## Tools & Validation

The repository includes a suite of tools for dataset validation and format conversion:

Click to expand

```bash

# Install dependencies

make install

# Run all validation checks

make validate

# Individual validation tools

make validate-schema      # Check parquet schema compliance

make validate-think       # Validate reasoning trace quality

make duplicates           # Detect near-duplicate questions

make metrics              # Generate quality metrics report

# Format conversion

make convert              # Convert parquet to JSON

```

### Available Scripts

Located in `tools/scripts/`:

| Script | Purpose |

|---|---|

| `validate_schema.py` | Verify parquet files conform to dataset schema |

| `validate_think_column.py` | Score reasoning traces on quality metrics (word count, option analysis, LLM artifacts, perspective) |

| `duplicate_detector.py` | Find similar/duplicate questions across shards using fuzzy matching |

| `quality_metrics.py` | Generate comprehensive quality reports (row counts, text lengths, completeness %) |

| `parquet_to_json.py` | Convert parquet shards to JSON for analysis |

| `csv_to_dataset_row.py` | Validate and convert CSV rows to dataset format |

| `validate_and_convert.py` | Combined validation + conversion tool |

### Usage Examples

```bash

# Validate all shards

python tools/scripts/validate_schema.py data/

# Check think column quality (threshold 0.7)

python tools/scripts/validate_think_column.py data/ --threshold 0.7

# Find duplicates with verbose output

python tools/scripts/duplicate_detector.py data/ --verbose

# Generate metrics report and export to JSON

python tools/scripts/quality_metrics.py data/ --export metrics.json

# Convert parquet to JSON

python tools/scripts/parquet_to_json.py data/

# Validate and convert CSV in one step

python tools/scripts/validate_and_convert.py --csv new_rows.csv --json output.json

# Validate existing parquet file

python tools/scripts/validate_and_convert.py --validate data/train-00000.parquet

```

## Contributing

 

Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## Citation \& Acknowledgments

```bibtex

@dataset{theelderemo_pentesting_explanations_2026,

    author       = { Christopher Dickinson },

    title        = { pentesting-explanations },

    year         = 2026,

    url          = { https://huggingface.co/datasets/theelderemo/pentesting-explanations },

    doi          = { 10.57967/hf/8471 },

    publisher    = { Hugging Face }

}

```

**HackTricks** - Special thanks to Carlos Polop and the entire HackTricks community for building and maintaining one of the most comprehensive open-source cybersecurity knowledge bases available. The HackTricks Wiki is the backbone of `train-00000`. [github.com/HackTricks-wiki/hacktricks](https://github.com/HackTricks-wiki/hacktricks)

**MITRE ATT\&CK** - `train-00001` is built on MITRE ATT\&CK STIX data from the [mitre/cti](https://github.com/mitre/cti) repository, licensed under Apache 2.0. ATT\&CK is a globally accessible knowledge base of adversary tactics and techniques based on real-world observations. [attack.mitre.org](https://attack.mitre.org)

**mitreattack-python** - Procedure example and sub-technique extraction powered by the [mitreattack-python](https://github.com/mitre-attack/mitreattack-python) library.

**Base benchmark** - The original evaluation set that seeded `train-00000` is courtesy of [preemware/pentesting-eval](https://huggingface.co/datasets/preemware/pentesting-eval).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/theelderemo/pentesting-explanations

Awesome Lists containing this project

README