https://github.com/futureg-lab/mx-scraper

Download image galleries or metadata accross the web
https://github.com/futureg-lab/mx-scraper

beautifulsoup4 cli cli-application downloader graphql-server image-gallery metadata-extraction python rust

Last synced: 9 months ago
JSON representation

Download image galleries or metadata accross the web

Host: GitHub
URL: https://github.com/futureg-lab/mx-scraper
Owner: futureg-lab
License: mit
Created: 2024-08-10T11:40:55.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-01T16:49:06.000Z (11 months ago)
Last Synced: 2025-02-01T17:32:24.259Z (11 months ago)
Topics: beautifulsoup4, cli, cli-application, downloader, graphql-server, image-gallery, metadata-extraction, python, rust
Language: Rust
Homepage:
Size: 409 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # mx-scraper

Download image galleries or metadata accross the web

> This rewrite is expected to support previous implementation's metadata format.

>

> The main idea was to separate the core (mx-scraper) from the plugins (user

> defined) as it was not possible from previous implementations.

# Usage

```bash

# the `images` plugin uses bs4

# This will for example download all image links using the `images` plugin.

pip install beautifulsoup4

mx-scraper fetch https://www.google.com --plugin images -v

# Alternatively, for a sequence of non-uniform terms, prefixing is often required

# Generally speaking, it is unnecessary but it is definitely required for generic ones (like ids or names)

# The behavior of how each term is parsed depends on the plugin implementation

mx-scraper fetch --meta-only -v img:https://www.google.com to:https://mto.to/series/68737

mx-scraper fetch --meta-only -v https://twitter.com/imigimuru/status/1829913427373953259

```

## Commands

```bash

mx-scraper engine

Usage: mx-scraper 

Commands:

  fetch        Fetch a sequence of terms

  fetch-files  Fetch a sequence of terms from a collection of files

  request      Request a url

  infos        Display various informations

  server       Spawn a graphql server

  help         Print this message or the help of the given subcommand(s)

Options:

  -h, --help  Print help

```

Each fetch strategy will share the same configuration..

# Features

- [x] CLI

  - [x] Fetch a list of terms

  - [x] Fetch a list of terms from a collection of files⌈

  - [x] Generic URL Request

    - [x] Print as text

    - [x] Download `--dest` flag

  - [x] Authentications (Basic, Bearer token)

- [x] Cookies

  - [x] Loading from a file (Netscape format, key-value)

  - [x] Loading from the config (key-value)

- [x] Downloader

  - [x] Support of older mx-scraper book schema

  - [x] Download

  - [x] Cache support (can be disabled with `--no-cache` or from config)

- [ ] Plugins

  - [x] Python plugin

    - [x] `MxRequest` with runtime context (headers, cookies, auth)

  - [x] gallery-dl extractors

  - [ ] Subprocess (e.g. imgbrd-grabber)

- [ ] HtmlParser (optional feature)

  - [ ] Implement `HtmlParser.use(source).where('attr.href = ..')`

  - [ ] Wrap into a python class

# GraphQL server

You can also use the extractors through GraphQL queries. You will have the same

options as the command-line interface.

```

Usage: mx-scraper server [OPTIONS]

Options:

      --port   Server port

  -h, --help         Print help

```

![Playground Screenshot](static/server.png "Screenshot")

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/futureg-lab/mx-scraper

Awesome Lists containing this project

README