{"id":50313772,"url":"https://github.com/theelderemo/pentesting-explanations","last_synced_at":"2026-05-28T22:32:06.276Z","repository":{"id":353297903,"uuid":"1218230627","full_name":"theelderemo/pentesting-explanations","owner":"theelderemo","description":"A high-quality supervised fine-tuning dataset for penetration testing expertise, red team tradecraft, and - as the dataset matures - novel vulnerability research and zero-day reasoning. The dataset is structured to teach models how to think like offensive security practitioners, not merely recall labels or technique names.","archived":false,"fork":false,"pushed_at":"2026-04-23T08:56:15.000Z","size":15542,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-23T10:33:38.378Z","etag":null,"topics":["cybersecurity","dataset","hacking","hacking-tool","huggingface","machine-learning","mitre-attack","offensive-security","penetration-testing","red-team","supervised-learning"],"latest_commit_sha":null,"homepage":"https://huggingface.co/datasets/theelderemo/pentesting-explanations","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/theelderemo.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-22T17:02:38.000Z","updated_at":"2026-04-23T08:56:18.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/theelderemo/pentesting-explanations","commit_stats":null,"previous_names":["theelderemo/pentesting-explanations"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/theelderemo/pentesting-explanations","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theelderemo%2Fpentesting-explanations","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theelderemo%2Fpentesting-explanations/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theelderemo%2Fpentesting-explanations/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theelderemo%2Fpentesting-explanations/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/theelderemo","download_url":"https://codeload.github.com/theelderemo/pentesting-explanations/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theelderemo%2Fpentesting-explanations/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33629560,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cybersecurity","dataset","hacking","hacking-tool","huggingface","machine-learning","mitre-attack","offensive-security","penetration-testing","red-team","supervised-learning"],"created_at":"2026-05-28T22:32:03.138Z","updated_at":"2026-05-28T22:32:06.270Z","avatar_url":"https://github.com/theelderemo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pentesting Explanations - Adversarial Reasoning \u0026 Vulnerability Research\n\n**A supervised fine-tuning dataset for adversarial reasoning, penetration testing expertise, and vulnerability research.**\n\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![HuggingFace Dataset](https://img.shields.io/badge/🤗%20Dataset-theelderemo%2Fpentesting--explanations-yellow)](https://huggingface.co/datasets/theelderemo/pentesting-explanations)\n[![Rows](https://img.shields.io/badge/rows-5%2C906-green)]()\n[![Tasks](https://img.shields.io/badge/tasks-SFT%20%7C%20GRPO%20%7C%20PRM-orange)]()\n\nMost cybersecurity datasets teach models to *recognize* known things: given a technique name, output its description; given a scenario, classify the attack vector. This is label memorization, and it produces models that fail the moment they encounter an unfamiliar codebase, a novel vulnerability class, or a non-textbook attack chain.\n\nThis dataset is built around a different objective: teaching models to reason through offensive security problems the way an expert practitioner does. Every row is designed to produce genuine deliberation. The `think` column is a live reasoning trace, option-by-option and hypothesis-by-hypothesis, written from the attacker's perspective, with dead ends included.\n \nThe long-term goal is to train models capable of genuine adversarial reasoning: hypothesis formation from unfamiliar code, data-flow tracing, variant hunting across patch history, and exploit primitive construction.\n\n\u003e **Dataset hosted on Hugging Face:** [theelderemo/pentesting-explanations](https://huggingface.co/datasets/theelderemo/pentesting-explanations)\n\n**This dataset is not:**  \n- A defensive or blue team dataset. Every question and reasoning trace is written from the attacker's perspective.\n- A detection or mitigation dataset. Questions never ask how to detect, alert on, or remediate techniques.\n- A label memorization dataset. The goal is never \"what is this called\" it is always \"how does an operator think through this decision.\"\n\n## Quick Start\n \n```python\nfrom datasets import load_dataset\n \n# Full dataset\nds = load_dataset(\"theelderemo/pentesting-explanations\")\n \n# HackTricks only\nds = load_dataset(\"theelderemo/pentesting-explanations\", config_name=\"hacktricks\")\n \n# MITRE ATT\u0026CK only\nds = load_dataset(\"theelderemo/pentesting-explanations\", config_name=\"mitre_attack\")\n \n# Isolated CoT for process reward / GRPO\nthink_only = ds[\"train\"][\"think\"]\n \n# SFT with apply_chat_template\nfrom transformers import AutoTokenizer\ntokenizer = AutoTokenizer.from_pretrained(\"your-model\")\nfor row in ds[\"train\"]:\n    formatted = tokenizer.apply_chat_template(row[\"messages\"], tokenize=False)\n```\n## Schema\n \nAll shards share this schema. MCQ columns will be `null` for future shards using open-ended formats.\n \n| Column | Type | Description |\n|---|---|---|\n| `question` | string | Multiple-choice question, framed from attacker perspective |\n| `choices` | list[str] | Four answer options (A–D). Distractors use real tools/commands in incorrect contexts |\n| `answer_idx` | int | Zero-based index of correct answer (0–3) |\n| `correct_letter` | string | Letter of correct answer (A, B, C, or D) |\n| `correct_choice` | string | Full text of the correct answer option |\n| `explanation` | string | Expert explanation: correct answer justification + per-option debunking, attacker perspective |\n| `prompt` | string | Full formatted prompt (system context + question + options) |\n| `response` | string | Bolded answer header + full explanation |\n| `think` | string | Isolated CoT deliberation. Option-by-option reasoning. Minimum 150 words. No answer restatement. |\n| `messages` | list[dict] | SFT-ready `[{\"role\": \"user\", ...}, {\"role\": \"assistant\", \"content\": \"\u003cthink\u003e...\u003c/think\u003e...\"}]` |\n \nThe `think` field is deliberately separated from `response` so process reward models can supervise the reasoning trace independently of the final answer.\n\n## Dataset Structure\n \nThe dataset uses a **one-source-per-parquet-file** design. Each data source lives in its own numbered shard so you can load exactly what you need without filter logic.\n \n```python\n# Select specific shards\nds = load_dataset(\"theelderemo/pentesting-explanations\", data_files={\n    \"train\": [\"data/train-00000.parquet\", \"data/train-00001.parquet\"]\n})\n```\n\n### Current Sources\n \n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eClick to expand - train-00000\u003c/strong\u003e\u003c/summary\u003e\n\n#### `train-00000` - HackTricks + Base Eval (3,228 rows)\n \nBuilt on [preemware/pentesting-eval](https://huggingface.co/datasets/preemware/pentesting-eval) and augmented with [HackTricks Wiki](https://github.com/HackTricks-wiki/hacktricks), processed into 5,404 cleaned Markdown chunks across 126 technical domains:\n \n- Active Directory attacks (Kerberoasting, AS-REP Roasting, Pass-the-Hash, DCSync, ADCS abuse, ACL/delegation attacks)\n- Web application exploitation (SQLi, XSS, SSRF, XXE, IDOR, deserialization, JWT attacks, OAuth abuse)\n- Linux privilege escalation (SUID/SGID, capabilities, cron, container escapes)\n- Windows privilege escalation (token impersonation, service misconfigurations, AlwaysInstallElevated)\n- Network attacks (LLMNR/NBT-NS poisoning, SMB relay, Kerberos attacks)\n- Cloud misconfigurations and exploitation paths (AWS, Azure, GCP)\n- Malware analysis (static/dynamic, sandbox evasion, unpacking)\n- Mobile security (Android, iOS)\n- Network services (FTP, SSH, SMTP, SNMP, RDP, WinRM)\n- Cryptographic attacks\n\nQuestions are generated with misconception-based distractors, wrong options that use real tools, real commands, and real techniques, just incorrect for the specific context being tested.\n \u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eClick to expand - train-00001\u003c/strong\u003e\u003c/summary\u003e\n\n#### `train-00001` - MITRE ATT\u0026CK Enterprise + Mobile + ICS (2,678 rows)\n \nBuilt from [mitre/cti](https://github.com/mitre/cti) STIX bundles (ATT\u0026CK version sourced at generation time). All revoked and deprecated techniques excluded.\n \n| Domain | Techniques |\n|---|---|\n| Enterprise | 691 |\n| Mobile | 124 |\n| ICS / OT | 79 |\n| **Total** | **894** |\n \nThree question angles per technique:\n \n| Angle | What it tests |\n|---|---|\n| `offensive-mechanics / how-it-works` | What the technique does and how an attacker executes it |\n| `operator-tradecraft / tool-and-command` | Specific tooling, commands, flags, and payloads |\n| `privilege-and-platform / preconditions` | Required access level, target OS, environment preconditions |\n \nEvery question is grounded in real-world procedure examples pulled from ATT\u0026CK STIX bundles via `mitreattack-python`, so distractors reference actual threat actor tooling (Mimikatz, Cobalt Strike, Impacket, CrackMapExec, BloodHound, etc.) in plausible-but-incorrect contexts.\n \u003c/details\u003e\n\n## Planned Shards\n \nThe dataset has a clear trajectory toward training models capable of novel vulnerability discovery. Each planned shard targets a reasoning primitive currently absent from public security training data.\n \n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eVulnerability Research \u0026 Exploit Development\u003c/strong\u003e\u003c/summary\u003e\n\n| Shard | Content |\n|---|---|\n| `train-00002` | CVE patch diff analysis - root cause + variant hunting reasoning |\n| `train-00003` | OSS-Fuzz source code audit traces (input boundary → data flow → primitive identification) |\n| `train-00004` | Exploit primitive → weaponization reasoning (UAF/OOB/type confusion → heap grooming → ROP chains) |\n| `train-00005` | Browser \u0026 renderer exploit chains (JIT bugs, V8/SpiderMonkey, sandbox escapes) |\n| `train-00006` | Kernel exploitation reasoning - LPE primitives, race conditions, KASLR/SMEP/SMAP bypass logic |\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eCTF \u0026 Competition Reasoning\u003c/strong\u003e\u003c/summary\u003e\n\n| Shard | Content |\n|---|---|\n| `train-00007` | CTF pwn reasoning chains |\n| `train-00008` | CTF web exploitation reasoning chains |\n| `train-00009` | CTF reversing \u0026 binary analysis chains |\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eThreat Intelligence \u0026 Adversary Simulation\u003c/strong\u003e\u003c/summary\u003e\n\n| Shard | Content |\n|---|---|\n| `train-00010` | APT campaign tradecraft - actor-specific toolchain decisions, campaign sequencing, OPSEC reasoning |\n| `train-00011` | Ransomware operator playbooks \u0026 affiliate tradecraft |\n| `train-00012` | State-sponsored implant \u0026 C2 framework analysis |\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eActive Directory \u0026 Enterprise Network\u003c/strong\u003e\u003c/summary\u003e\n\n| Shard | Content |\n|---|---|\n| `train-00013` | Active Directory attack chains - Kerberoasting, AS-REP, ADCS ESC1–ESC13, ACL abuse, delegation |\n| `train-00014` | LOLBAS / LOLDrivers / GTFOBins operational reasoning |\n| `train-00015` | Cloud attack paths - AWS, Azure, GCP IAM privesc and cross-service pivots |\n \u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eWeb Application \u0026 API Exploitation\u003c/strong\u003e\u003c/summary\u003e\n\n| Shard | Content |\n|---|---|\n| `train-00016` | PayloadsAllTheThings structured exploitation reasoning |\n| `train-00017` | Bug bounty root cause reasoning - HackerOne disclosed reports |\n| `train-00018` | Web cache poisoning, HTTP desync \u0026 request smuggling |\n| `train-00019` | OAuth, OIDC \u0026 SSO attack reasoning |\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eMalware Analysis, ICS/OT, Embedded, and More\u003c/strong\u003e\u003c/summary\u003e\n\n| Shard | Content |\n|---|---|\n| `train-00020` | Malware analysis reasoning - dynamic + static |\n| `train-00021` | Obfuscation \u0026 packer analysis chains |\n| `train-00022` | ICS/SCADA attack reasoning (PLC logic abuse, HMI pivots, Industroyer/TRITON analyses) |\n| `train-00023` | Firmware analysis \u0026 embedded exploitation |\n| `train-00024` | ired.team operator notes - process injection, AV evasion, OPSEC tradecraft |\n| `train-00025` | Proving Grounds / HTB retired machine reasoning chains | \n\u003c/details\u003e\n\nAs the dataset matures, the MCQ scaffolding is gradually replaced by open-ended vulnerability research tasks with no labeled answer choices - only a reasoning process and a conclusion, mirroring the actual cognitive structure of zero-day discovery.\n \n## Intended Use\n \n**Primary use cases:**\n- Supervised fine-tuning for penetration testing and red team LLMs\n- Training adversarial reasoning and systematic distractor elimination\n- Process reward model training using the isolated `think` column (GRPO, DPO, RLHF)\n- Building autonomous vulnerability research agents\n- Security certification preparation (OSCP, OSED, GREM, GPEN, GXPN)\n- Threat emulation and adversary simulation training\n**Responsible use:** This dataset is intended for legitimate security research, penetration testing education, and the development of defensive AI tools. All techniques are documented in public sources (MITRE ATT\u0026CK, HackTricks, academic research). Users are responsible for ensuring their use complies with applicable laws and ethical guidelines.\n\n## Tools \u0026 Validation\n\nThe repository includes a suite of tools for dataset validation and format conversion:\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eClick to expand\u003c/strong\u003e\u003c/summary\u003e\n\n```bash\n# Install dependencies\nmake install\n\n# Run all validation checks\nmake validate\n\n# Individual validation tools\nmake validate-schema      # Check parquet schema compliance\nmake validate-think       # Validate reasoning trace quality\nmake duplicates           # Detect near-duplicate questions\nmake metrics              # Generate quality metrics report\n\n# Format conversion\nmake convert              # Convert parquet to JSON\n```\n\n### Available Scripts\n\nLocated in `tools/scripts/`:\n\n| Script | Purpose |\n|---|---|\n| `validate_schema.py` | Verify parquet files conform to dataset schema |\n| `validate_think_column.py` | Score reasoning traces on quality metrics (word count, option analysis, LLM artifacts, perspective) |\n| `duplicate_detector.py` | Find similar/duplicate questions across shards using fuzzy matching |\n| `quality_metrics.py` | Generate comprehensive quality reports (row counts, text lengths, completeness %) |\n| `parquet_to_json.py` | Convert parquet shards to JSON for analysis |\n| `csv_to_dataset_row.py` | Validate and convert CSV rows to dataset format |\n| `validate_and_convert.py` | Combined validation + conversion tool |\n\n### Usage Examples\n\n```bash\n# Validate all shards\npython tools/scripts/validate_schema.py data/\n\n# Check think column quality (threshold 0.7)\npython tools/scripts/validate_think_column.py data/ --threshold 0.7\n\n# Find duplicates with verbose output\npython tools/scripts/duplicate_detector.py data/ --verbose\n\n# Generate metrics report and export to JSON\npython tools/scripts/quality_metrics.py data/ --export metrics.json\n\n# Convert parquet to JSON\npython tools/scripts/parquet_to_json.py data/\n\n# Validate and convert CSV in one step\npython tools/scripts/validate_and_convert.py --csv new_rows.csv --json output.json\n\n# Validate existing parquet file\npython tools/scripts/validate_and_convert.py --validate data/train-00000.parquet\n```\n\u003c/details\u003e\n\n## Contributing\n \nContributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## Citation \\\u0026 Acknowledgments\n\n```bibtex\n@dataset{theelderemo_pentesting_explanations_2026,\n    author       = { Christopher Dickinson },\n    title        = { pentesting-explanations },\n    year         = 2026,\n    url          = { https://huggingface.co/datasets/theelderemo/pentesting-explanations },\n    doi          = { 10.57967/hf/8471 },\n    publisher    = { Hugging Face }\n}\n```\n\n**HackTricks** - Special thanks to Carlos Polop and the entire HackTricks community for building and maintaining one of the most comprehensive open-source cybersecurity knowledge bases available. The HackTricks Wiki is the backbone of `train-00000`. [github.com/HackTricks-wiki/hacktricks](https://github.com/HackTricks-wiki/hacktricks)\n\n**MITRE ATT\\\u0026CK** - `train-00001` is built on MITRE ATT\\\u0026CK STIX data from the [mitre/cti](https://github.com/mitre/cti) repository, licensed under Apache 2.0. ATT\\\u0026CK is a globally accessible knowledge base of adversary tactics and techniques based on real-world observations. [attack.mitre.org](https://attack.mitre.org)\n\n**mitreattack-python** - Procedure example and sub-technique extraction powered by the [mitreattack-python](https://github.com/mitre-attack/mitreattack-python) library.\n\n**Base benchmark** - The original evaluation set that seeded `train-00000` is courtesy of [preemware/pentesting-eval](https://huggingface.co/datasets/preemware/pentesting-eval).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftheelderemo%2Fpentesting-explanations","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftheelderemo%2Fpentesting-explanations","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftheelderemo%2Fpentesting-explanations/lists"}