https://github.com/exfil0/pdfdisarm
Advanced PDF Analysis & Disarm Tool is a robust Python-based utility designed to scan, analyze, and neutralize potentially malicious elements in PDF files.
https://github.com/exfil0/pdfdisarm
analysis cybersecurity malware pdf pdfdisarm pyhton3 security threatdetection
Last synced: 4 months ago
JSON representation
Advanced PDF Analysis & Disarm Tool is a robust Python-based utility designed to scan, analyze, and neutralize potentially malicious elements in PDF files.
- Host: GitHub
- URL: https://github.com/exfil0/pdfdisarm
- Owner: exfil0
- Created: 2025-02-12T00:06:11.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-12T00:09:14.000Z (4 months ago)
- Last Synced: 2025-02-12T01:23:27.051Z (4 months ago)
- Topics: analysis, cybersecurity, malware, pdf, pdfdisarm, pyhton3, security, threatdetection
- Language: Python
- Homepage:
- Size: 19.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Advanced PDF Analysis & Disarm Tool
## Overview
This tool scans, analyzes, and optionally “disarms” PDF files. It provides:
- **PDF Structure Analysis**: Detects keywords, calculates entropy, and identifies malicious indicators such as embedded JavaScript and launch actions.
- **Concurrency**: Uses Python’s ThreadPoolExecutor to process multiple files in parallel.
- **Disarm Mode**: Generates a `.disarmed.pdf` that strips or obfuscates dangerous elements like `/JS`, `/JavaScript`, `/Launch`, etc.
- **Directory Recursion**: Gathers files from a given directory or directories, optionally recursing.
- **Plugin Architecture**: Supports loading custom plugins for scoring or additional checks.
- **Selection Expressions**: Allows filtering results (e.g., show only PDFs with certain suspicious counts).
- **Multiple Output Formats**:
- Human-readable console output
- CSV format (one line per file)
- JSON export function (PDFiD2JSON) available for custom usageUse it at your own risk.
---
## Installation
1. **Requirements**:
- Python 3.7+ (recommended)
- Optional library: `pyzipper` for AES-encrypted ZIP support
- Standard libraries: `argparse`, `concurrent.futures`, `urllib.request`, etc. (included in most Python installs)2. **Clone or Download** this script:
```bash
git clone https://github.com/exfil0/PDFdisarm.git
```
*(If this is just an example—use your preferred distribution method.)*3. **Make It Executable (Linux/Mac)**:
```bash
chmod +x pdfscan.py
```4. **(Optional) Install pyzipper**:
```bash
pip install pyzipper
```---
## Usage
### Basic Command
```bash
./pdfscan.py ...
```
- Analyzes each file and prints detailed results to the console.### Wildcards and Directory Recursion
```bash
./pdfscan.py /path/to/pdfs -r
```
- Recursively scans all files under `/path/to/pdfs`.### Disarm Mode
```bash
./pdfscan.py malicious.pdf --disarm
```
- Creates `malicious.disarmed.pdf` with potentially malicious elements neutralized.### CSV Output
```bash
./pdfscan.py /path/to/pdfs -r --csv -o results.csv
```
- Outputs a single CSV with all scan results, one row per file.
- If `-o` is not specified, CSV goes to `stdout`.### Selecting Files by Condition
```bash
./pdfscan.py *.pdf --select="pdf.js.count > 0"
```
- Only shows results for files where the JavaScript (`/JS`) count is greater than zero.### Plugin Usage
```bash
./pdfscan.py suspicious.pdf --plugins=MyPlugin.py --csv
```
- Loads a custom plugin (`MyPlugin.py`) which can provide additional scoring or checks.### Threading
```bash
./pdfscan.py /path/to/pdfs --threads 8
```
- Uses 8 worker threads to speed up scanning across many files.---
## Command-Line Options
- **`files`** (positional):
- One or more file paths, directory paths, or wildcard patterns.
- **`-r, --recursedir`**: Recurse into subdirectories when a directory is provided.
- **`-o, --output`**: Specify output file (CSV only).
- **`--all`**: Show all recognized PDF keywords (even non-standard ones).
- **`--extra`**: Collect extra data such as dates and entropy.
- **`--force`**: Force scanning even if the PDF header is missing.
- **`--disarm`**: Write a disarmed copy of each PDF as `.disarmed.pdf`.
- **`--select`**: Python expression to filter results, e.g. `pdf.js.count>0`.
- **`--nozero`**: Suppress printing zero counts in console output.
- **`--threads`**: Number of parallel worker threads (default=4).
- **`--scan`**: Legacy option, similar to scanning a directory.
- **`--plugins`**: Comma-separated list of plugin `.py` files to load.
- **`--pluginoptions`**: Additional string to pass to plugins.
- **`--csv`**: Output results to CSV (to file if `-o` is specified, else stdout).
- **`--minimumscore`**: Only show files or plugin results that meet or exceed this numeric score.
- **`--verbose`**: Print detailed tracebacks on errors.---
## Example Workflows
1. **Single File Quick Scan**
```bash
./pdfscan.py mydocument.pdf
```
Displays a detailed report (keywords, potential malicious actions) in the console.2. **Multiple PDFs, CSV Output**
```bash
./pdfscan.py /opt/pdfs/*.pdf --csv -o results.csv
```
Gathers results in `results.csv`, easy to import into Excel.3. **Full Directory Disarm**
```bash
./pdfscan.py /opt/malware-pdfs -r --disarm
```
Recursively generates `*.disarmed.pdf` copies.---
## Plugin Notes
- **Plugin Classes** must subclass `cPluginParent`.
- The script automatically discovers plugin classes from the loaded files.
- Each plugin typically implements a `Score()` method returning a numeric score.---
## Disclaimer
Authored and maintained by **Exfil0**.
**No warranties** are provided. Use at your own risk.Feel free to adapt and redistribute **with attribution** to Exfil0.