https://github.com/theelderemo/pentesting-explanations
A high-quality supervised fine-tuning dataset for penetration testing expertise, red team tradecraft, and - as the dataset matures - novel vulnerability research and zero-day reasoning. The dataset is structured to teach models how to think like offensive security practitioners, not merely recall labels or technique names.
https://github.com/theelderemo/pentesting-explanations
cybersecurity dataset hacking hacking-tool huggingface machine-learning mitre-attack offensive-security penetration-testing red-team supervised-learning
Last synced: 28 days ago
JSON representation
A high-quality supervised fine-tuning dataset for penetration testing expertise, red team tradecraft, and - as the dataset matures - novel vulnerability research and zero-day reasoning. The dataset is structured to teach models how to think like offensive security practitioners, not merely recall labels or technique names.
- Host: GitHub
- URL: https://github.com/theelderemo/pentesting-explanations
- Owner: theelderemo
- License: apache-2.0
- Created: 2026-04-22T17:02:38.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-23T08:56:15.000Z (2 months ago)
- Last Synced: 2026-04-23T10:33:38.378Z (2 months ago)
- Topics: cybersecurity, dataset, hacking, hacking-tool, huggingface, machine-learning, mitre-attack, offensive-security, penetration-testing, red-team, supervised-learning
- Language: Python
- Homepage: https://huggingface.co/datasets/theelderemo/pentesting-explanations
- Size: 14.8 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Pentesting Explanations - Adversarial Reasoning & Vulnerability Research
**A supervised fine-tuning dataset for adversarial reasoning, penetration testing expertise, and vulnerability research.**
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/datasets/theelderemo/pentesting-explanations)
[]()
[]()
Most cybersecurity datasets teach models to *recognize* known things: given a technique name, output its description; given a scenario, classify the attack vector. This is label memorization, and it produces models that fail the moment they encounter an unfamiliar codebase, a novel vulnerability class, or a non-textbook attack chain.
This dataset is built around a different objective: teaching models to reason through offensive security problems the way an expert practitioner does. Every row is designed to produce genuine deliberation. The `think` column is a live reasoning trace, option-by-option and hypothesis-by-hypothesis, written from the attacker's perspective, with dead ends included.
The long-term goal is to train models capable of genuine adversarial reasoning: hypothesis formation from unfamiliar code, data-flow tracing, variant hunting across patch history, and exploit primitive construction.
> **Dataset hosted on Hugging Face:** [theelderemo/pentesting-explanations](https://huggingface.co/datasets/theelderemo/pentesting-explanations)
**This dataset is not:**
- A defensive or blue team dataset. Every question and reasoning trace is written from the attacker's perspective.
- A detection or mitigation dataset. Questions never ask how to detect, alert on, or remediate techniques.
- A label memorization dataset. The goal is never "what is this called" it is always "how does an operator think through this decision."
## Quick Start
```python
from datasets import load_dataset
# Full dataset
ds = load_dataset("theelderemo/pentesting-explanations")
# HackTricks only
ds = load_dataset("theelderemo/pentesting-explanations", config_name="hacktricks")
# MITRE ATT&CK only
ds = load_dataset("theelderemo/pentesting-explanations", config_name="mitre_attack")
# Isolated CoT for process reward / GRPO
think_only = ds["train"]["think"]
# SFT with apply_chat_template
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your-model")
for row in ds["train"]:
formatted = tokenizer.apply_chat_template(row["messages"], tokenize=False)
```
## Schema
All shards share this schema. MCQ columns will be `null` for future shards using open-ended formats.
| Column | Type | Description |
|---|---|---|
| `question` | string | Multiple-choice question, framed from attacker perspective |
| `choices` | list[str] | Four answer options (AβD). Distractors use real tools/commands in incorrect contexts |
| `answer_idx` | int | Zero-based index of correct answer (0β3) |
| `correct_letter` | string | Letter of correct answer (A, B, C, or D) |
| `correct_choice` | string | Full text of the correct answer option |
| `explanation` | string | Expert explanation: correct answer justification + per-option debunking, attacker perspective |
| `prompt` | string | Full formatted prompt (system context + question + options) |
| `response` | string | Bolded answer header + full explanation |
| `think` | string | Isolated CoT deliberation. Option-by-option reasoning. Minimum 150 words. No answer restatement. |
| `messages` | list[dict] | SFT-ready `[{"role": "user", ...}, {"role": "assistant", "content": "......"}]` |
The `think` field is deliberately separated from `response` so process reward models can supervise the reasoning trace independently of the final answer.
## Dataset Structure
The dataset uses a **one-source-per-parquet-file** design. Each data source lives in its own numbered shard so you can load exactly what you need without filter logic.
```python
# Select specific shards
ds = load_dataset("theelderemo/pentesting-explanations", data_files={
"train": ["data/train-00000.parquet", "data/train-00001.parquet"]
})
```
### Current Sources
Click to expand - train-00000
#### `train-00000` - HackTricks + Base Eval (3,228 rows)
Built on [preemware/pentesting-eval](https://huggingface.co/datasets/preemware/pentesting-eval) and augmented with [HackTricks Wiki](https://github.com/HackTricks-wiki/hacktricks), processed into 5,404 cleaned Markdown chunks across 126 technical domains:
- Active Directory attacks (Kerberoasting, AS-REP Roasting, Pass-the-Hash, DCSync, ADCS abuse, ACL/delegation attacks)
- Web application exploitation (SQLi, XSS, SSRF, XXE, IDOR, deserialization, JWT attacks, OAuth abuse)
- Linux privilege escalation (SUID/SGID, capabilities, cron, container escapes)
- Windows privilege escalation (token impersonation, service misconfigurations, AlwaysInstallElevated)
- Network attacks (LLMNR/NBT-NS poisoning, SMB relay, Kerberos attacks)
- Cloud misconfigurations and exploitation paths (AWS, Azure, GCP)
- Malware analysis (static/dynamic, sandbox evasion, unpacking)
- Mobile security (Android, iOS)
- Network services (FTP, SSH, SMTP, SNMP, RDP, WinRM)
- Cryptographic attacks
Questions are generated with misconception-based distractors, wrong options that use real tools, real commands, and real techniques, just incorrect for the specific context being tested.
Click to expand - train-00001
#### `train-00001` - MITRE ATT&CK Enterprise + Mobile + ICS (2,678 rows)
Built from [mitre/cti](https://github.com/mitre/cti) STIX bundles (ATT&CK version sourced at generation time). All revoked and deprecated techniques excluded.
| Domain | Techniques |
|---|---|
| Enterprise | 691 |
| Mobile | 124 |
| ICS / OT | 79 |
| **Total** | **894** |
Three question angles per technique:
| Angle | What it tests |
|---|---|
| `offensive-mechanics / how-it-works` | What the technique does and how an attacker executes it |
| `operator-tradecraft / tool-and-command` | Specific tooling, commands, flags, and payloads |
| `privilege-and-platform / preconditions` | Required access level, target OS, environment preconditions |
Every question is grounded in real-world procedure examples pulled from ATT&CK STIX bundles via `mitreattack-python`, so distractors reference actual threat actor tooling (Mimikatz, Cobalt Strike, Impacket, CrackMapExec, BloodHound, etc.) in plausible-but-incorrect contexts.
## Planned Shards
The dataset has a clear trajectory toward training models capable of novel vulnerability discovery. Each planned shard targets a reasoning primitive currently absent from public security training data.
Vulnerability Research & Exploit Development
| Shard | Content |
|---|---|
| `train-00002` | CVE patch diff analysis - root cause + variant hunting reasoning |
| `train-00003` | OSS-Fuzz source code audit traces (input boundary β data flow β primitive identification) |
| `train-00004` | Exploit primitive β weaponization reasoning (UAF/OOB/type confusion β heap grooming β ROP chains) |
| `train-00005` | Browser & renderer exploit chains (JIT bugs, V8/SpiderMonkey, sandbox escapes) |
| `train-00006` | Kernel exploitation reasoning - LPE primitives, race conditions, KASLR/SMEP/SMAP bypass logic |
CTF & Competition Reasoning
| Shard | Content |
|---|---|
| `train-00007` | CTF pwn reasoning chains |
| `train-00008` | CTF web exploitation reasoning chains |
| `train-00009` | CTF reversing & binary analysis chains |
Threat Intelligence & Adversary Simulation
| Shard | Content |
|---|---|
| `train-00010` | APT campaign tradecraft - actor-specific toolchain decisions, campaign sequencing, OPSEC reasoning |
| `train-00011` | Ransomware operator playbooks & affiliate tradecraft |
| `train-00012` | State-sponsored implant & C2 framework analysis |
Active Directory & Enterprise Network
| Shard | Content |
|---|---|
| `train-00013` | Active Directory attack chains - Kerberoasting, AS-REP, ADCS ESC1βESC13, ACL abuse, delegation |
| `train-00014` | LOLBAS / LOLDrivers / GTFOBins operational reasoning |
| `train-00015` | Cloud attack paths - AWS, Azure, GCP IAM privesc and cross-service pivots |
Web Application & API Exploitation
| Shard | Content |
|---|---|
| `train-00016` | PayloadsAllTheThings structured exploitation reasoning |
| `train-00017` | Bug bounty root cause reasoning - HackerOne disclosed reports |
| `train-00018` | Web cache poisoning, HTTP desync & request smuggling |
| `train-00019` | OAuth, OIDC & SSO attack reasoning |
Malware Analysis, ICS/OT, Embedded, and More
| Shard | Content |
|---|---|
| `train-00020` | Malware analysis reasoning - dynamic + static |
| `train-00021` | Obfuscation & packer analysis chains |
| `train-00022` | ICS/SCADA attack reasoning (PLC logic abuse, HMI pivots, Industroyer/TRITON analyses) |
| `train-00023` | Firmware analysis & embedded exploitation |
| `train-00024` | ired.team operator notes - process injection, AV evasion, OPSEC tradecraft |
| `train-00025` | Proving Grounds / HTB retired machine reasoning chains |
As the dataset matures, the MCQ scaffolding is gradually replaced by open-ended vulnerability research tasks with no labeled answer choices - only a reasoning process and a conclusion, mirroring the actual cognitive structure of zero-day discovery.
## Intended Use
**Primary use cases:**
- Supervised fine-tuning for penetration testing and red team LLMs
- Training adversarial reasoning and systematic distractor elimination
- Process reward model training using the isolated `think` column (GRPO, DPO, RLHF)
- Building autonomous vulnerability research agents
- Security certification preparation (OSCP, OSED, GREM, GPEN, GXPN)
- Threat emulation and adversary simulation training
**Responsible use:** This dataset is intended for legitimate security research, penetration testing education, and the development of defensive AI tools. All techniques are documented in public sources (MITRE ATT&CK, HackTricks, academic research). Users are responsible for ensuring their use complies with applicable laws and ethical guidelines.
## Tools & Validation
The repository includes a suite of tools for dataset validation and format conversion:
Click to expand
```bash
# Install dependencies
make install
# Run all validation checks
make validate
# Individual validation tools
make validate-schema # Check parquet schema compliance
make validate-think # Validate reasoning trace quality
make duplicates # Detect near-duplicate questions
make metrics # Generate quality metrics report
# Format conversion
make convert # Convert parquet to JSON
```
### Available Scripts
Located in `tools/scripts/`:
| Script | Purpose |
|---|---|
| `validate_schema.py` | Verify parquet files conform to dataset schema |
| `validate_think_column.py` | Score reasoning traces on quality metrics (word count, option analysis, LLM artifacts, perspective) |
| `duplicate_detector.py` | Find similar/duplicate questions across shards using fuzzy matching |
| `quality_metrics.py` | Generate comprehensive quality reports (row counts, text lengths, completeness %) |
| `parquet_to_json.py` | Convert parquet shards to JSON for analysis |
| `csv_to_dataset_row.py` | Validate and convert CSV rows to dataset format |
| `validate_and_convert.py` | Combined validation + conversion tool |
### Usage Examples
```bash
# Validate all shards
python tools/scripts/validate_schema.py data/
# Check think column quality (threshold 0.7)
python tools/scripts/validate_think_column.py data/ --threshold 0.7
# Find duplicates with verbose output
python tools/scripts/duplicate_detector.py data/ --verbose
# Generate metrics report and export to JSON
python tools/scripts/quality_metrics.py data/ --export metrics.json
# Convert parquet to JSON
python tools/scripts/parquet_to_json.py data/
# Validate and convert CSV in one step
python tools/scripts/validate_and_convert.py --csv new_rows.csv --json output.json
# Validate existing parquet file
python tools/scripts/validate_and_convert.py --validate data/train-00000.parquet
```
## Contributing
Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
## Citation \& Acknowledgments
```bibtex
@dataset{theelderemo_pentesting_explanations_2026,
author = { Christopher Dickinson },
title = { pentesting-explanations },
year = 2026,
url = { https://huggingface.co/datasets/theelderemo/pentesting-explanations },
doi = { 10.57967/hf/8471 },
publisher = { Hugging Face }
}
```
**HackTricks** - Special thanks to Carlos Polop and the entire HackTricks community for building and maintaining one of the most comprehensive open-source cybersecurity knowledge bases available. The HackTricks Wiki is the backbone of `train-00000`. [github.com/HackTricks-wiki/hacktricks](https://github.com/HackTricks-wiki/hacktricks)
**MITRE ATT\&CK** - `train-00001` is built on MITRE ATT\&CK STIX data from the [mitre/cti](https://github.com/mitre/cti) repository, licensed under Apache 2.0. ATT\&CK is a globally accessible knowledge base of adversary tactics and techniques based on real-world observations. [attack.mitre.org](https://attack.mitre.org)
**mitreattack-python** - Procedure example and sub-technique extraction powered by the [mitreattack-python](https://github.com/mitre-attack/mitreattack-python) library.
**Base benchmark** - The original evaluation set that seeded `train-00000` is courtesy of [preemware/pentesting-eval](https://huggingface.co/datasets/preemware/pentesting-eval).