https://github.com/futureg-lab/mx-scraper
Download image galleries or metadata accross the web
https://github.com/futureg-lab/mx-scraper
beautifulsoup4 cli cli-application downloader graphql-server image-gallery metadata-extraction python rust
Last synced: 9 months ago
JSON representation
Download image galleries or metadata accross the web
- Host: GitHub
- URL: https://github.com/futureg-lab/mx-scraper
- Owner: futureg-lab
- License: mit
- Created: 2024-08-10T11:40:55.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-01T16:49:06.000Z (11 months ago)
- Last Synced: 2025-02-01T17:32:24.259Z (11 months ago)
- Topics: beautifulsoup4, cli, cli-application, downloader, graphql-server, image-gallery, metadata-extraction, python, rust
- Language: Rust
- Homepage:
- Size: 409 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# mx-scraper
Download image galleries or metadata accross the web
> This rewrite is expected to support previous implementation's metadata format.
>
> The main idea was to separate the core (mx-scraper) from the plugins (user
> defined) as it was not possible from previous implementations.
# Usage
```bash
# the `images` plugin uses bs4
# This will for example download all image links using the `images` plugin.
pip install beautifulsoup4
mx-scraper fetch https://www.google.com --plugin images -v
# Alternatively, for a sequence of non-uniform terms, prefixing is often required
# Generally speaking, it is unnecessary but it is definitely required for generic ones (like ids or names)
# The behavior of how each term is parsed depends on the plugin implementation
mx-scraper fetch --meta-only -v img:https://www.google.com to:https://mto.to/series/68737
mx-scraper fetch --meta-only -v https://twitter.com/imigimuru/status/1829913427373953259
```
## Commands
```bash
mx-scraper engine
Usage: mx-scraper
Commands:
fetch Fetch a sequence of terms
fetch-files Fetch a sequence of terms from a collection of files
request Request a url
infos Display various informations
server Spawn a graphql server
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
```
Each fetch strategy will share the same configuration..
# Features
- [x] CLI
- [x] Fetch a list of terms
- [x] Fetch a list of terms from a collection of files⌈
- [x] Generic URL Request
- [x] Print as text
- [x] Download `--dest` flag
- [x] Authentications (Basic, Bearer token)
- [x] Cookies
- [x] Loading from a file (Netscape format, key-value)
- [x] Loading from the config (key-value)
- [x] Downloader
- [x] Support of older mx-scraper book schema
- [x] Download
- [x] Cache support (can be disabled with `--no-cache` or from config)
- [ ] Plugins
- [x] Python plugin
- [x] `MxRequest` with runtime context (headers, cookies, auth)
- [x] gallery-dl extractors
- [ ] Subprocess (e.g. imgbrd-grabber)
- [ ] HtmlParser (optional feature)
- [ ] Implement `HtmlParser.use(source).where('attr.href = ..')`
- [ ] Wrap into a python class
# GraphQL server
You can also use the extractors through GraphQL queries. You will have the same
options as the command-line interface.
```
Usage: mx-scraper server [OPTIONS]
Options:
--port Server port
-h, --help Print help
```
