https://github.com/archivebox/abx-dl

⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless Chrome to get HTML, JS, CSS, images/video/audio/subtitles, PDFs, screenshots, article text, git repos, and more...
https://github.com/archivebox/abx-dl

ai-scraping archivebox chrome cli cli-tool crawling curl downloader gallery-dl headless http-client internet-archiving playwright puppeteer scraping wget youtube-dl yt-dlp

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/archivebox/abx-dl
Owner: ArchiveBox
License: mit
Created: 2024-10-21T11:11:54.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-12-26T07:05:49.000Z (6 months ago)
Last Synced: 2025-03-17T15:11:55.062Z (3 months ago)
Topics: ai-scraping, archivebox, chrome, cli, cli-tool, crawling, curl, downloader, gallery-dl, headless, http-client, internet-archiving, playwright, puppeteer, scraping, wget, youtube-dl, yt-dlp
Language: JavaScript
Homepage: https://docs.sweeting.me/s/archivebox-plugin-ecosystem-announcement#%F0%9F%94%A2-For-the-minimalists-who-just-want-something-simple
Size: 177 KB
Stars: 66
Watchers: 5
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# ⬇️ `abx-dl`

> A simple all-in-one CLI tool to auto-detect and download *everything* available from a URL.
> `pip install abx-dl`
> `abx-dl 'https://example.com/page/to/download'`

> [!IMPORTANT]
> ❈ NOT YET RELEASED *Coming Soon...* read the [Plugin Ecosystem Announcement (2024-10)](https://docs.sweeting.me/s/archivebox-plugin-ecosystem-announcement#%F0%9F%94%A2-For-the-minimalists-who-just-want-something-simple)
> _{Release ETA: after [`archivebox` `v0.9.0`](https://github.com/ArchiveBox/ArchiveBox/releases/)} 🚀 [Donate to support development!](https://donate.archivebox.io/)

---

✨ *Ever wish you could `yt-dlp`, `gallery-dl`, `wget`, `curl`, `puppeteer`, etc. all in one command?*

`abx-dl` is an all-in-one CLI tool for downloading URLs "by any means necessary".

It's useful for scraping, downloading, OSINT, digital preservation, and more.
`abx-dl` is built to provide a simpler one-shot CLI interface to the [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox) archiving engine (it replaces the old `archivebox oneshot` command).

---

#### 🍜 What does it save?

```python
abx-dl --extract=title,favicon,headers,wget,media,singlefile,screenshot,pdf,dom,readability,git,... 'https://example.com'`
```

`abx-dl` gets everything by default, or you can tell it to `--extract=...` specific methods:
- HTML, JS, CSS, images, etc. rendered with a headless browser
- title, favicon, headers, outlinks, and other metadata
- audio, video, subtitles, playlists, comments
- snapshot of the page as a PDF, screenshot, and [Singlefile](https://github.com/gildas-lormeau/single-file-cli) HTML
- article text, `git` source code
- [and much more](https://github.com/ArchiveBox/abx-dl#All-Outputs)...

#### 🧩 How does it work?

Forget about writing janky manual crawling scripts with `JS`/`Python`/`playwright`/`puppeteer`/`bash`.

`abx-dl` renders all URLs passed in a fully-featured modern browser using puppeteer.
It auto-detects a wide variety of embedded resources using plugins, and extracts discovered content out to raw files (`mp4`, `png`, `txt`, `pdf`, `html`, etc.) in the current working directory.

> `abx-dl` collects all of your favorite powerful scraping and downloading tools, including: `wget`, `wget-lua`, `curl`, `puppeteer`, `playwright`, `singlefile`, `readability`, `yt-dlp`, `forum-dl`, and many more through the **[ABX Plugin Library](https://docs.sweeting.me/s/archivebox-plugin-ecosystem-announcement)** (shared with [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox))...

You no longer have to deal with installing and configuring a bunch of tools individually.

#### ⚙️ What options does it provide?

Pass `--extract=` to get only what you need, and set other config via env vars / args:

- `USER_AGENT`, `CHECK_SSL_VALIDITY`, `CHROME_USER_DATA_DIR`/`COOKIES_TXT`
- `TIMEOUT=60`, `MAX_MEDIA_SIZE=750m`, `RESOLUTION=1440,2000`, `ONLY_NEW=True`
- [and more here](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration)...

^{Configuration options apply seamlessly across all methods.}

---

### 📦 ~~Install~~ `Coming Soon...`

```bash
pip install abx-dl
abx-dl install # optional: install any system packages needed
```

### 🔠 Usage

```bash
# Basic usage:
abx-dl [--help|--version] [--config|-c] [--extract=methods] [url]
```

#### Download everything

```bash
abx-dl 'https://example.com'
ls ./
#
```

#### Download just title + screenshot

```bash
abx-dl --extract=title,screenshot 'https://example.com'
ls ./
# index.json title.txt screenshot.png
```

#### Download title + screenshot + html + media

```bash
abx-dl --extract=title,favicon,screenshot,singlefile,media 'https://example.com'
ls ./
# index.json index.html title.txt favicon.ico screenshot.png singlefile.html media/Some_video.mp4
```

#### Pass config options

Config can be persisted via file, set via env vars, or passed via CLI args.
```bash
# set per-user config in ~/.config/abx-dl/abx-dl.conf
abx-dl config --set CHECK_SSL_VALIDITY=True

# environment variables work too and are equivalent
env CHROME_USER_DATA_DIR=~/.config/abx-dl/personas/Default/chrome_profile

# pass per-run config as CLI args
abx-dl -c MAX_MEDIA_SIZE=250m --extract=title,singlefile,screenshot,media 'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
```

---

### All Outputs

- `index.json`, `index.html`
- `title.txt`, `title.json`, `headers.json`, `favicon.ico`
- `example.com/*.{html,css,js,png...}`, `warc/` (saved with `wget-lua`)
- `screenshot.png`, `dom.html`, `output.pdf` (rendered with `chrome`)
- `media/someVideo.mp4`, `media/subtitles`, ... (downloaded with `yt-dlp`)
- `readability/`, `mercury/`, `htmltotext.txt` (article text/markdown)
- `git/` (source code)
- ... [and more via plugin library](https://github.com/ArchiveBox/ArchiveBox#output-formats) ...

For more advanced use with collections, parallel downloading, a Web UI + REST API, etc.
See: [`ArchiveBox/ArchiveBox`](https://github.com/ArchiveBox/ArchiveBox)

---

❈ Created by the ArchiveBox team in Emeryville, California. ❈

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/archivebox/abx-dl

Awesome Lists containing this project

README