Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/archivebox/abx-dl

⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless Chrome to get HTML, JS, CSS, images/video/audio/subtitles, PDFs, screenshots, article text, git repos, and more...
https://github.com/archivebox/abx-dl

ai-scraping archivebox chrome cli cli-tool crawling curl downloader gallery-dl headless http-client internet-archiving playwright puppeteer scraping wget youtube-dl yt-dlp

Last synced: about 2 months ago
JSON representation

⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless Chrome to get HTML, JS, CSS, images/video/audio/subtitles, PDFs, screenshots, article text, git repos, and more...

Awesome Lists containing this project

README

        

# ⬇️ `abx-dl`

> A simple all-in-one CLI tool to auto-detect and download *everything* available from a URL.
> `pip install abx-dl`
> `abx-dl 'https://example.com/page/to/download'`

> [!IMPORTANT]
> ❈ NOT YET RELEASED *Coming Soon...* read the [Plugin Ecosystem Announcement (2024-10)](https://docs.sweeting.me/s/archivebox-plugin-ecosystem-announcement#%F0%9F%94%A2-For-the-minimalists-who-just-want-something-simple)
> Release ETA: after [`archivebox` `v0.9.0`](https://github.com/ArchiveBox/ArchiveBox/releases/)                         🚀 [Donate to support development!](https://donate.archivebox.io/)

---

✨ *Ever wish you could `yt-dlp`, `gallery-dl`, `wget`, `curl`, `puppeteer`, etc. all in one command?*

`abx-dl` is an all-in-one CLI tool for downloading URLs "by any means necessary".

It's useful for scraping, downloading, OSINT, digital preservation, and more.
`abx-dl` is built to provide a simpler one-shot CLI interface to the [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox) archiving engine.

---


#### 🍜 What does it save?

```python
abx-dl --extract=title,favicon,headers,wget,media,singlefile,screenshot,pdf,dom,readability,git,... 'https://example.com'`
```

`abx-dl` gets everything by default, or you can tell it to `--extract=...` specific methods:
- HTML, JS, CSS, images, etc. rendered with a headless browser
- title, favicon, headers, outlinks, and other metadata
- audio, video, subtitles, playlists, comments
- snapshot of the page as a PDF, screenshot, and [Singlefile](https://github.com/gildas-lormeau/single-file-cli) HTML
- article text, `git` source code, [and much more](https://github.com/ArchiveBox/abx-dl#All-Outputs)...


#### 🧩 How does it work?

Forget about writing janky manual crawling scripts with `JS`/`Python`/`playwright`/`puppeteer`/`bash`.

`abx-dl` renders all URLs passed in a fully-featured modern browser using puppeteer.
It auto-detects a wide variety of embedded resources using plugins, and extracts discovered content out to raw files (`mp4`, `png`, `txt`, `pdf`, `html`, etc.) in the current working directory.

> `abx-dl` collects all of your favorite powerful scraping and downloading tools, including: `wget`, `wget-lua`, `curl`, `puppeteer`, `playwright`, `singlefile`, `readability`, `yt-dlp`, `forum-dl`, and many more through the **[ABX Plugin Library](https://docs.sweeting.me/s/archivebox-plugin-ecosystem-announcement)** (shared with [ArchiveBox](https://github.com/ArchiveBox/ArchiveBox))...

You no longer have to deal with installing and configuring a bunch of tools individually.


#### ⚙️ What options does it provide?

Pass `--exctract=` to get only what you need, and set other config via env vars / args:

- `USER_AGENT`, `CHECK_SSL_VALIDITY`, `CHROME_USER_DATA_DIR`/`COOKIES_TXT`
- `TIMEOUT=60`, `MAX_MEDIA_SIZE=750m`, `RESOLUTION=1440,2000`, `ONLY_NEW=True`
- [and more here](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration)...

Configuration options apply seamlessly across all methods.


---


### 📦 ~~Install~~ `Coming Soon...`

```bash
pip install abx-dl[all]
abx-dl install # optional: install any system packages needed
```

If you don't need everything in abx-dl[all], you can pick and choose individual pieces...

🪶 Lightweight Install


pip install abx-dl[favicon,wget,singlefile,readability,git]

abx-dl install wget,singlefile,readability
abx-dl --extract=wget,singlefile,... 'https://example.com'


### 🔠 Usage

```bash
# Basic usage:
abx-dl [--help|--version] [--config|-c] [--extract=methods] [url]
```

#### Download everything

```bash
abx-dl 'https://example.com'
ls ./
#
```

#### Download just title + screenshot

```bash
abx-dl --extract=title,screenshot 'https://example.com'
ls ./
# index.json title.txt screenshot.png
```

#### Download title + screenshot + html + media

```bash
abx-dl --extract=title,favicon,screenshot,singlefile,media 'https://example.com'
ls ./
# index.json index.html title.txt favicon.ico screenshot.png singlefile.html media/Some_video.mp4
```

#### Pass config options

Config can be persisted via file, set via env vars, or passed via CLI args.
```bash
# set per-user config in ~/.config/abx-dl/abx-dl.conf
abx-dl config --set CHECK_SSL_VALIDITY=True

# environment variables work too and are equivalent
env CHROME_USER_DATA_DIR=~/.config/abx-dl/personas/Default/chrome_profile

# pass per-run config as CLI args
abx-dl -c MAX_MEDIA_SIZE=250m --extract=title,singlefile,screenshot,media 'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
```


---


### All Outputs

- `index.json`, `index.html`
- `title.txt`, `title.json`, `headers.json`, `favicon.ico`
- `example.com/*.{html,css,js,png...}`, `warc/` (saved with `wget-lua`)
- `screenshot.png`, `dom.html`, `output.pdf` (rendered with `chrome`)
- `media/someVideo.mp4`, `media/subtitles`, ... (downloaded with `yt-dlp`)
- `readability/`, `mercury/`, `htmltotext.txt` (article text/markdown)
- `git/` (source code)
- ... [and more via plugin library](https://github.com/ArchiveBox/ArchiveBox#output-formats) ...

For more advanced use with collections, parallel downloading, a Web UI + REST API, etc.
See: [`ArchiveBox/ArchiveBox`](https://github.com/ArchiveBox/ArchiveBox)

---


❈ Created by the ArchiveBox team in Emeryville, California. ❈