An open API service indexing awesome lists of open source software.

https://github.com/testy-cool/scrape-gateway

Unified scraping gateway — 7 providers, cheapest-first routing, content validation, domain memory. One sgw url call, it figures out the rest.
https://github.com/testy-cool/scrape-gateway

anti-bot cli proxy python scraping web-scraping

Last synced: 1 day ago
JSON representation

Unified scraping gateway — 7 providers, cheapest-first routing, content validation, domain memory. One sgw url call, it figures out the rest.

Awesome Lists containing this project

README

          

# scrape-gateway (`sgw`)

[![ci](https://github.com/testy-cool/scrape-gateway/actions/workflows/ci.yml/badge.svg)](https://github.com/testy-cool/scrape-gateway/actions/workflows/ci.yml)
[![version](https://img.shields.io/badge/version-0.3.0-blue)](https://github.com/testy-cool/scrape-gateway/releases/latest)
[![license](https://img.shields.io/badge/license-Apache--2.0-green)](LICENSE)


sgw demo — free providers fail, paid provider succeeds, next time it remembers

One command, seven providers. Free ones tried first, paid ones only when needed. Domain memory skips the trial-and-error on repeat visits.

## Quick start

```bash
git clone https://github.com/testy-cool/scrape-gateway.git
cd scrape-gateway
pip install -e .
cp .env.example .env # add API keys (optional — free providers work without any)

sgw selftest # verify installation
sgw url https://example.com
```

## Commands

| Command | What it does |
|---|---|
| `sgw url ` | Scrape one page through the provider chain |
| `sgw extract ` | Pull structured data (JSON/CSV) from listing pages |
| `sgw detect ` | Recon — find repeated elements before extracting |
| `sgw links ` | Index all links on a page |
| `sgw follow ` | Scrape link #n from a page |
| `sgw recipe ` | Replay a saved YAML workflow |
| `sgw run ` | Batch scrape URLs from a text file |
| `sgw meta ` | Extract OpenGraph metadata as JSON |
| `sgw history ` | Show scrape timeline and page changes |
| `sgw telemetry` | Inspect recent scrape reports |
| `sgw providers` | List all available providers |
| `sgw extensions` | Browse/install community extensions |
| `sgw selftest` | Verify installation with known-safe sites |

Full usage and examples: [docs/commands.md](docs/commands.md)

## Providers

7 built-in, 3 free. Router tries cheapest first.

| Provider | Cost | JS | Geo | Anti-bot |
|---|---|---|---|---|
| `raw_http` | free | no | no | none |
| `wreq` | free | no | no | TLS fingerprinting |
| `curl_cffi` | free | no | no | TLS fingerprinting |
| `scrapedrive` | paid | yes | yes | full (3 tiers) |
| `scrape_do` | paid | yes | yes | residential proxies |
| `scrapingbee` | paid | yes | yes | premium proxies |
| `scraperapi` | paid | yes | yes | premium proxies |

Add API keys in `.env` to enable paid providers. Without them, `sgw` uses free providers only.

## Extend it

Drop a `.py` file in `~/.config/scrape-gateway/providers/` or install from the registry with `sgw extensions`. See [docs/extensions.md](docs/extensions.md).

## Python API

```python
from scrape_gateway import ScrapeGateway, ScrapeRequest

gw = ScrapeGateway.from_config()
result = await gw.scrape(ScrapeRequest("https://example.com"))
```

More: [docs/python-api.md](docs/python-api.md)

## Docs

- [Commands](docs/commands.md) — full reference with examples
- [Architecture](docs/architecture.md) — how the router, cache, and memory work
- [Configuration](docs/configuration.md) — YAML config and `.env` setup
- [Extensions](docs/extensions.md) — writing custom providers
- [Python API](docs/python-api.md) — using sgw as a library
- [Providers](docs/providers.md) — provider details and API mapping