{"id":17623327,"url":"https://github.com/archivebox/abx-dl","last_synced_at":"2026-03-15T16:21:53.956Z","repository":{"id":259002785,"uuid":"876078115","full_name":"ArchiveBox/abx-dl","owner":"ArchiveBox","description":"⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless Chrome to get HTML, JS, CSS, images/video/audio/subtitles, PDFs, screenshots, article text, git repos, and more...","archived":false,"fork":false,"pushed_at":"2024-12-26T07:05:49.000Z","size":181,"stargazers_count":66,"open_issues_count":2,"forks_count":4,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-17T15:11:55.062Z","etag":null,"topics":["ai-scraping","archivebox","chrome","cli","cli-tool","crawling","curl","downloader","gallery-dl","headless","http-client","internet-archiving","playwright","puppeteer","scraping","wget","youtube-dl","yt-dlp"],"latest_commit_sha":null,"homepage":"https://docs.sweeting.me/s/archivebox-plugin-ecosystem-announcement#%F0%9F%94%A2-For-the-minimalists-who-just-want-something-simple","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArchiveBox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-21T11:11:54.000Z","updated_at":"2025-03-14T19:04:22.000Z","dependencies_parsed_at":"2024-11-26T00:17:24.080Z","dependency_job_id":null,"html_url":"https://github.com/ArchiveBox/abx-dl","commit_stats":{"total_commits":76,"total_committers":1,"mean_commits":76.0,"dds":0.0,"last_synced_commit":"8a5d29606834ba6d48891e3b924becd1717589e6"},"previous_names":["pirate/abx-dl"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Fabx-dl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Fabx-dl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Fabx-dl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveBox%2Fabx-dl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArchiveBox","download_url":"https://codeload.github.com/ArchiveBox/abx-dl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244056425,"owners_count":20390719,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-scraping","archivebox","chrome","cli","cli-tool","crawling","curl","downloader","gallery-dl","headless","http-client","internet-archiving","playwright","puppeteer","scraping","wget","youtube-dl","yt-dlp"],"created_at":"2024-10-22T21:07:54.404Z","updated_at":"2026-03-15T16:21:53.949Z","avatar_url":"https://github.com/ArchiveBox.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ⬇️ `abx-dl`\n\n\u003e A simple all-in-one CLI tool to auto-detect and download *everything* available from a URL.\n\n```bash\nuvx --from abx-dl abx-dl 'https://example.com'\n```\n---\n\n✨ *Ever wish you could `yt-dlp`, `gallery-dl`, `wget`, `curl`, `puppeteer`, etc. all in one command?*\n\n`abx-dl` is an all-in-one CLI tool for downloading URLs \"by any means necessary\".\n\nIt's useful for scraping, downloading, OSINT, digital preservation, and more.\n`abx-dl` provides a simpler one-shot CLI interface to the [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox) plugin ecosystem.\n\n\u003cimg width=\"1000\" height=\"1082\" alt=\"Screenshot 2026-03-11 at 6 53 03 AM\" src=\"https://github.com/user-attachments/assets/4e19d985-1a93-4f65-9970-2565be16b718\" /\u003e\n\n\n---\n\n\u003cbr/\u003e\n\n#### 🍜 What does it save?\n\n```bash\nabx-dl dl --plugins=wget,title,screenshot,pdf,readability,git 'https://example.com'\n```\n\n`abx-dl` runs all plugins by default, or you can specify `--plugins=...` for specific methods:\n- HTML, JS, CSS, images, etc. rendered with a headless browser\n- title, favicon, headers, outlinks, and other metadata\n- audio, video, subtitles, playlists, comments\n- snapshot of the page as a PDF, screenshot, and [Singlefile](https://github.com/gildas-lormeau/single-file-cli) HTML\n- article text, `git` source code\n- [and much more](https://github.com/ArchiveBox/abx-dl#All-Outputs)...\n\n\u003cbr/\u003e\n\n#### 🧩 How does it work?\n\n`abx-dl` uses the **[ABX Plugin Library](https://docs.sweeting.me/s/archivebox-plugin-ecosystem-announcement)** (shared with [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox)) to run a collection of downloading and scraping tools.\n\nPlugins are loaded from the installed `abx-plugins` package (or from `ABX_PLUGINS_DIR` if you override it) and execute hooks in order:\n1. **Crawl hooks** run first (setup/install dependencies like Chrome)\n2. **Snapshot hooks** run per-URL to extract content\n\nEach plugin can output:\n- Files to its output directory\n- JSONL records for status reporting\n- Config updates that propagate to subsequent plugins\n\n\u003cbr/\u003e\n\n#### ⚙️ Configuration\n\nConfiguration is handled via environment variables or persistent config file (`~/.config/abx/config.env`):\n\n```bash\nabx-dl config                        # show all config (global + per-plugin)\nabx-dl config --get WGET_TIMEOUT     # get a specific value\nabx-dl config --set TIMEOUT=120      # set persistently (resolves aliases)\n```\n\nOutput is grouped by section:\n```bash\n# GLOBAL\nTIMEOUT=60\nUSER_AGENT=\"Mozilla/5.0 ...\"\n...\n\n# plugins/wget\nWGET_BINARY=\"wget\"\nWGET_TIMEOUT=60\n...\n\n# plugins/chrome\nCHROME_BINARY=\"chromium\"\n...\n```\n\nCommon options:\n- `TIMEOUT=60` - default timeout for hooks\n- `USER_AGENT` - default user agent string\n- `{PLUGIN}_BINARY` - path to plugin's binary (e.g. `WGET_BINARY`, `CHROME_BINARY`)\n- `{PLUGIN}_ENABLED=true/false` - enable/disable specific plugins\n- `{PLUGIN}_TIMEOUT=120` - per-plugin timeout overrides\n\nAliases are automatically resolved (e.g. `--set USE_WGET=false` saves as `WGET_ENABLED=false`).\n\nOne-off tuning is often easiest via env vars or CLI args:\n\n```bash\nTIMEOUT=120 USER_AGENT='Mozilla/5.0 (abx-dl smoke test)' abx-dl 'https://example.com'\nCHROME_BINARY=/usr/bin/chromium LIB_DIR=./.abx/lib abx-dl --plugins=screenshot,pdf 'https://example.com'\nabx-dl --output=./runs/example --plugins=wget,title --timeout=90 'https://example.com'\n```\n\n\u003cbr/\u003e\n\n---\n\n\u003cbr/\u003e\n\n### 📦 Install\n\n```bash\n# From this repo\nuv sync\nuv run abx-dl 'https://example.com'\n\n# Or run the published CLI without installing it globally\nuvx --from abx-dl abx-dl 'https://example.com'\n\n# Pre-install dependency hooks if you want a deterministic first run\nuv run abx-dl install wget\n```\n\n\u003cbr/\u003e\n\n### 🔠 Usage\n\n```bash\n# Default command - a bare URL archives with all enabled plugins:\nabx-dl 'https://example.com'\n\n# Limit work to a subset of plugins:\nabx-dl --plugins=wget,title,screenshot,pdf 'https://example.com'\n\n# Skip auto-installing missing dependencies (emit warnings instead):\nabx-dl --no-install 'https://example.com'\n\n# Specify output directory:\nabx-dl --output=./downloads 'https://example.com'\n\n# Set timeout:\nabx-dl --timeout=120 'https://example.com'\n```\n\n#### Commands\n\n```bash\nabx-dl \u003curl\u003e                              # Download URL (default shorthand)\nabx-dl plugins                            # Check + show info for all plugins\nabx-dl plugins wget ytdlp git             # Check + show info for specific plugins\nabx-dl install wget ytdlp git             # Pre-install plugin dependencies\nabx-dl config                             # Show all config values\nabx-dl config --get TIMEOUT               # Get a specific config value\nabx-dl config --set TIMEOUT=120           # Set a config value persistently\n```\n\n#### Installing Dependencies\n\nMany plugins require external binaries (e.g., `wget`, `chrome`, `yt-dlp`, `single-file`).\n\nBy default, `abx-dl` lazily auto-installs missing dependencies as needed when you download a URL.\nUse `--no-install` to skip plugins with missing dependencies instead. `install` runs crawl/install hooks ahead of time without downloading a snapshot:\n\n```bash\n# Auto-installs missing deps on-the-fly (default behavior)\nabx-dl 'https://example.com'\n\n# Skip plugins with missing deps, emit warnings instead\nabx-dl --no-install 'https://example.com'\n\n# Install dependencies for specific plugins only\nabx-dl install wget singlefile ytdlp\n\n# Check which dependencies are available/missing\nabx-dl plugins\n```\n\nDependencies are installed to `~/.config/abx/lib/{arch}/` using the appropriate package manager:\n- **pip packages** → `~/.config/abx/lib/{arch}/pip/venv/`\n- **npm packages** → `~/.config/abx/lib/{arch}/npm/`\n- **brew/apt packages** → system locations\n\nYou can override the install location with `LIB_DIR=/path/to/lib abx-dl install wget`.\n\n\u003cbr/\u003e\n\n---\n\n\u003cbr/\u003e\n\n### Output Structure\n\nBy default, `abx-dl` writes results into the current working directory. Each run creates an `index.jsonl` manifest plus one subdirectory per plugin that produced output. If you want to keep runs isolated, `cd` into a scratch directory first or pass `--output=/path/to/run`.\n\n```bash\nmkdir -p /tmp/abx-run \u0026\u0026 cd /tmp/abx-run\nuvx --from abx-dl abx-dl --plugins=title,wget 'https://example.com'\n```\n\n```\n./\n├── index.jsonl             # Snapshot metadata and results (JSONL format)\n├── title/\n│   └── title.txt\n├── favicon/\n│   └── favicon.ico\n├── screenshot/\n│   └── screenshot.png\n├── pdf/\n│   └── output.pdf\n├── dom/\n│   └── output.html\n├── wget/\n│   └── example.com/\n│       └── index.html\n├── singlefile/\n│   └── output.html\n└── ...\n```\n\n\u003cbr/\u003e\n\n### All Outputs\n\n- `index.jsonl` - snapshot metadata and plugin results (JSONL format, ArchiveBox-compatible)\n- `title/title.txt` - page title\n- `favicon/favicon.ico` - site favicon\n- `screenshot/screenshot.png` - full page screenshot (Chrome)\n- `pdf/output.pdf` - page as PDF (Chrome)\n- `dom/output.html` - rendered DOM (Chrome)\n- `wget/example.com/...` - mirrored site files\n- `singlefile/output.html` - single-file HTML snapshot\n- ... and more via plugin library ...\n\n---\n\n### Available Plugins\n\nGenerated from the `abx-plugins` marketplace docs. Each line lists the plugin and the kinds of outputs it can produce.\n\n#### Snapshot / Extraction Plugins\n\n- `ytdlp` - downloads media plus sidecars: audio, video, images/thumbnails, subtitles (`.srt`, `.vtt`), JSON metadata, and text descriptions.\n- `gallerydl` - downloads gallery/media sets as images, videos, JSON sidecars, text sidecars, and ZIP archives.\n- `forumdl` - exports forum/thread archives as JSONL, WARC, and mailbox-style message archives.\n- `git` - clones repository contents including text, binaries, images, audio, video, fonts, and other tracked files.\n- `wget` - mirrors pages and requisites as HTML, WARC, images, CSS, JavaScript, fonts, audio, and video.\n- `archivedotorg` - saves a Wayback Machine archive link as plain text.\n- `favicon` - saves site favicons and touch icons as image files.\n- `modalcloser` - setup helper only; no direct archive files.\n- `consolelog` - saves browser console events as JSONL.\n- `dns` - saves observed DNS activity as JSONL.\n- `ssl` - saves TLS certificate/connection metadata as JSONL.\n- `responses` - saves HTTP response metadata as JSONL and can record referenced text, images, audio, video, apps, and fonts.\n- `redirects` - saves redirect chains as JSONL.\n- `staticfile` - saves non-HTML direct file responses such as PDF, EPUB, images, audio, video, JSON, XML, CSV, ZIP, and generic binary files.\n- `headers` - saves main-document HTTP headers as JSON.\n- `chrome` - manages shared browser state and emits plain-text and JSON runtime metadata.\n- `seo` - saves SEO metadata such as meta tags and Open Graph fields as JSON.\n- `accessibility` - saves the browser accessibility tree as JSON.\n- `infiniscroll` - page-expansion helper only; no direct archive files.\n- `claudechrome` - saves Claude-computer-use interaction results as JSON plus PNG screenshots.\n- `singlefile` - saves a full self-contained page snapshot as HTML.\n- `screenshot` - saves rendered page screenshots as PNG.\n- `pdf` - saves rendered pages as PDF.\n- `dom` - saves fully rendered DOM output as HTML.\n- `title` - saves the final page title as plain text.\n- `readability` - extracts article HTML, plain text, and JSON metadata.\n- `defuddle` - extracts cleaned article HTML, plain text, and JSON metadata.\n- `mercury` - extracts article HTML, plain text, and JSON metadata.\n- `claudecodeextract` - generates cleaned Markdown from other extractor outputs.\n- `htmltotext` - converts archived HTML into plain text.\n- `trafilatura` - extracts article content as plain text, Markdown, HTML, CSV, JSON, and XML/TEI.\n- `papersdl` - downloads academic papers as PDF.\n- `parse_html_urls` - emits discovered links from HTML as JSONL records.\n- `parse_txt_urls` - emits discovered links from text files as JSONL records.\n- `parse_rss_urls` - emits discovered feed entry URLs from RSS/Atom as JSONL records.\n- `parse_netscape_urls` - emits discovered bookmark URLs from Netscape bookmark exports as JSONL records.\n- `parse_jsonl_urls` - emits discovered bookmark URLs from JSONL exports as JSONL records.\n- `parse_dom_outlinks` - emits crawlable rendered-DOM outlinks as JSONL records.\n- `search_backend_sqlite` - writes a searchable SQLite FTS index database.\n- `search_backend_sonic` - pushes content into Sonic search; no local archive files declared.\n- `claudecodecleanup` - writes cleanup/deduplication results as plain text.\n- `hashes` - writes file hash manifests as JSON.\n\n#### Setup / Binary / Utility Plugins\n\n- `npm` - installs npm-provided binaries and exposes Node module paths; no direct archive files.\n- `claudecode` - runs Claude Code over snapshots and emits JSON results.\n- `search_backend_ripgrep` - search helper for archived files; no direct archive files.\n- `puppeteer` - installs/manages Chromium via Puppeteer; no direct archive files.\n- `ublock` - installs uBlock Origin for cleaner browser captures; no direct archive files.\n- `istilldontcareaboutcookies` - installs cookie-banner suppression helpers; no direct archive files.\n- `twocaptcha` - installs/configures CAPTCHA-solving browser helpers; no direct archive files.\n- `pip` - installs Python-based binaries into a managed virtualenv; no direct archive files.\n- `brew` - installs binaries with Homebrew; no direct archive files.\n- `apt` - installs binaries with APT; no direct archive files.\n- `custom` - installs binaries via a custom shell command; no direct archive files.\n- `env` - discovers binaries already on `PATH`; no direct archive files.\n- `base` - shared utilities/test support for other plugins; no direct archive files.\n- `media` - shared namespace/helpers for media-related plugins; no direct archive files.\n\n---\n\n### AI Skill\n\nThis repo includes an `abx-dl` skill for coding agents that need to run the standalone ArchiveBox extractor pipeline without a full ArchiveBox install.\n\n- Skill source: [`skills/abx-dl/SKILL.md`](./skills/abx-dl/SKILL.md)\n- skills.sh page: https://skills.sh/archivebox/abx-dl/abx-dl\n\n---\n\n### Architecture\n\n`abx-dl` is built on these components:\n\n- **`abx_dl/plugins.py`** - Plugin discovery from `abx-plugins` or `ABX_PLUGINS_DIR`\n- **`abx_dl/executor.py`** - Hook execution engine with config propagation\n- **`abx_dl/config.py`** - Environment variable configuration\n- **`abx_dl/cli.py`** - Rich CLI with live progress display\n\n---\n\nFor more advanced use with collections, parallel downloading, a Web UI + REST API, etc.\nSee: [`ArchiveBox/ArchiveBox`](https://github.com/ArchiveBox/ArchiveBox)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchivebox%2Fabx-dl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farchivebox%2Fabx-dl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchivebox%2Fabx-dl/lists"}