https://github.com/scrapegraphai/scrapegraph-py
Official Python SDK for the ScrapeGraph AI API. Smart scraping, search, crawling, markdownify, agentic browser automation, scheduled jobs, and structured data extraction
https://github.com/scrapegraphai/scrapegraph-py
api json-schema python scrapegraph scraping sdk-js sdk-nodejs sdk-python web-crawler web-scraping web-scraping-python
Last synced: 6 days ago
JSON representation
Official Python SDK for the ScrapeGraph AI API. Smart scraping, search, crawling, markdownify, agentic browser automation, scheduled jobs, and structured data extraction
- Host: GitHub
- URL: https://github.com/scrapegraphai/scrapegraph-py
- Owner: ScrapeGraphAI
- License: mit
- Created: 2024-10-30T12:46:08.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-04-14T21:41:16.000Z (12 days ago)
- Last Synced: 2026-04-14T22:10:56.010Z (12 days ago)
- Topics: api, json-schema, python, scrapegraph, scraping, sdk-js, sdk-nodejs, sdk-python, web-crawler, web-scraping, web-scraping-python
- Language: Jupyter Notebook
- Homepage: https://scrapegraphai.com
- Size: 13.8 MB
- Stars: 69
- Watchers: 1
- Forks: 15
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ScrapeGraphAI Python SDK
[](https://badge.fury.io/py/scrapegraph-py)
[](https://opensource.org/licenses/MIT)
Official Python SDK for the [ScrapeGraphAI API](https://scrapegraphai.com).
## Install
```bash
pip install scrapegraph-py
# or
uv add scrapegraph-py
```
## Quick Start
```python
from scrapegraph_py import ScrapeGraphAI, ScrapeRequest
# reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI(api_key="...")
sgai = ScrapeGraphAI()
result = sgai.scrape(ScrapeRequest(
url="https://example.com",
))
if result.status == "success":
print(result.data["results"]["markdown"]["data"])
else:
print(result.error)
```
Every method returns `ApiResult[T]` — no exceptions to catch:
```python
@dataclass
class ApiResult(Generic[T]):
status: Literal["success", "error"]
data: T | None
error: str | None
elapsed_ms: int
```
## API
### scrape
Scrape a webpage in multiple formats (markdown, html, screenshot, json, etc).
```python
from scrapegraph_py import (
ScrapeGraphAI, ScrapeRequest, FetchConfig,
MarkdownFormatConfig, ScreenshotFormatConfig, JsonFormatConfig
)
sgai = ScrapeGraphAI()
res = sgai.scrape(ScrapeRequest(
url="https://example.com",
formats=[
MarkdownFormatConfig(mode="reader"),
ScreenshotFormatConfig(full_page=True, width=1440, height=900),
JsonFormatConfig(prompt="Extract product info"),
],
content_type="text/html", # optional, auto-detected
fetch_config=FetchConfig( # optional
mode="js", # "auto" | "fast" | "js"
stealth=True,
timeout=30000,
wait=2000,
scrolls=3,
headers={"Accept-Language": "en"},
cookies={"session": "abc"},
country="us",
),
))
```
**Formats:**
- `markdown` — Clean markdown (modes: `normal`, `reader`, `prune`)
- `html` — Raw HTML (modes: `normal`, `reader`, `prune`)
- `links` — All links on the page
- `images` — All image URLs
- `summary` — AI-generated summary
- `json` — Structured extraction with prompt/schema
- `branding` — Brand colors, typography, logos
- `screenshot` — Page screenshot (full_page, width, height, quality)
### extract
Extract structured data from a URL, HTML, or markdown using AI.
```python
from scrapegraph_py import ScrapeGraphAI, ExtractRequest
sgai = ScrapeGraphAI()
res = sgai.extract(ExtractRequest(
url="https://example.com",
prompt="Extract product names and prices",
schema={"type": "object", "properties": {...}}, # optional
mode="reader", # optional
fetch_config=FetchConfig(...), # optional
))
# Or pass html/markdown directly instead of url
```
### search
Search the web and optionally extract structured data.
```python
from scrapegraph_py import ScrapeGraphAI, SearchRequest
sgai = ScrapeGraphAI()
res = sgai.search(SearchRequest(
query="best programming languages 2024",
num_results=5, # 1-20, default 3
format="markdown", # "markdown" | "html"
prompt="Extract key points", # optional, for AI extraction
schema={...}, # optional
time_range="past_week", # optional
location_geo_code="us", # optional
fetch_config=FetchConfig(...), # optional
))
```
### crawl
Crawl a website and its linked pages.
```python
from scrapegraph_py import ScrapeGraphAI, CrawlRequest, MarkdownFormatConfig
sgai = ScrapeGraphAI()
# Start a crawl
start = sgai.crawl.start(CrawlRequest(
url="https://example.com",
formats=[MarkdownFormatConfig()],
max_pages=50,
max_depth=2,
max_links_per_page=10,
include_patterns=["/blog/*"],
exclude_patterns=["/admin/*"],
fetch_config=FetchConfig(...),
))
# Check status
status = sgai.crawl.get(start.data["id"])
# Control
sgai.crawl.stop(crawl_id)
sgai.crawl.resume(crawl_id)
sgai.crawl.delete(crawl_id)
```
### monitor
Monitor a webpage for changes on a schedule.
```python
from scrapegraph_py import ScrapeGraphAI, MonitorCreateRequest, MarkdownFormatConfig
sgai = ScrapeGraphAI()
# Create a monitor
mon = sgai.monitor.create(MonitorCreateRequest(
url="https://example.com",
name="Price Monitor",
interval="0 * * * *", # cron expression
formats=[MarkdownFormatConfig()],
webhook_url="https://...", # optional
fetch_config=FetchConfig(...),
))
# Manage monitors
sgai.monitor.list()
sgai.monitor.get(cron_id)
sgai.monitor.update(cron_id, MonitorUpdateRequest(interval="0 */6 * * *"))
sgai.monitor.pause(cron_id)
sgai.monitor.resume(cron_id)
sgai.monitor.delete(cron_id)
```
### history
Fetch request history.
```python
from scrapegraph_py import ScrapeGraphAI, HistoryFilter
sgai = ScrapeGraphAI()
history = sgai.history.list(HistoryFilter(
service="scrape", # optional filter
page=1,
limit=20,
))
entry = sgai.history.get("request-id")
```
### credits / health
```python
from scrapegraph_py import ScrapeGraphAI
sgai = ScrapeGraphAI()
credits = sgai.credits()
# { remaining: 1000, used: 500, plan: "pro", jobs: { crawl: {...}, monitor: {...} } }
health = sgai.health()
# { status: "ok", uptime: 12345 }
```
## Async Client
All methods have async equivalents via `AsyncScrapeGraphAI`:
```python
import asyncio
from scrapegraph_py import AsyncScrapeGraphAI, ScrapeRequest
async def main():
async with AsyncScrapeGraphAI() as sgai:
result = await sgai.scrape(ScrapeRequest(url="https://example.com"))
if result.status == "success":
print(result.data["results"]["markdown"]["data"])
else:
print(result.error)
asyncio.run(main())
```
### Async Extract
```python
async with AsyncScrapeGraphAI() as sgai:
res = await sgai.extract(ExtractRequest(
url="https://example.com",
prompt="Extract product names and prices",
))
```
### Async Search
```python
async with AsyncScrapeGraphAI() as sgai:
res = await sgai.search(SearchRequest(
query="best programming languages 2024",
num_results=5,
))
```
### Async Crawl
```python
async with AsyncScrapeGraphAI() as sgai:
start = await sgai.crawl.start(CrawlRequest(
url="https://example.com",
max_pages=50,
))
status = await sgai.crawl.get(start.data["id"])
```
### Async Monitor
```python
async with AsyncScrapeGraphAI() as sgai:
mon = await sgai.monitor.create(MonitorCreateRequest(
url="https://example.com",
name="Price Monitor",
interval="0 * * * *",
))
```
## Examples
### Sync Examples
| Service | Example | Description |
|---------|---------|-------------|
| scrape | [`scrape_basic.py`](examples/scrape/scrape_basic.py) | Basic markdown scraping |
| scrape | [`scrape_multi_format.py`](examples/scrape/scrape_multi_format.py) | Multiple formats |
| scrape | [`scrape_json_extraction.py`](examples/scrape/scrape_json_extraction.py) | Structured JSON extraction |
| scrape | [`scrape_pdf.py`](examples/scrape/scrape_pdf.py) | PDF document parsing |
| scrape | [`scrape_with_fetchconfig.py`](examples/scrape/scrape_with_fetchconfig.py) | JS rendering, stealth mode |
| extract | [`extract_basic.py`](examples/extract/extract_basic.py) | AI data extraction |
| extract | [`extract_with_schema.py`](examples/extract/extract_with_schema.py) | Extraction with JSON schema |
| search | [`search_basic.py`](examples/search/search_basic.py) | Web search |
| search | [`search_with_extraction.py`](examples/search/search_with_extraction.py) | Search + AI extraction |
| crawl | [`crawl_basic.py`](examples/crawl/crawl_basic.py) | Start and monitor a crawl |
| crawl | [`crawl_with_formats.py`](examples/crawl/crawl_with_formats.py) | Crawl with formats |
| monitor | [`monitor_basic.py`](examples/monitor/monitor_basic.py) | Create a page monitor |
| monitor | [`monitor_with_webhook.py`](examples/monitor/monitor_with_webhook.py) | Monitor with webhook |
| utilities | [`credits.py`](examples/utilities/credits.py) | Check credits and limits |
| utilities | [`health.py`](examples/utilities/health.py) | API health check |
| utilities | [`history.py`](examples/utilities/history.py) | Request history |
### Async Examples
| Service | Example | Description |
|---------|---------|-------------|
| scrape | [`scrape_basic_async.py`](examples/scrape/scrape_basic_async.py) | Basic markdown scraping |
| scrape | [`scrape_multi_format_async.py`](examples/scrape/scrape_multi_format_async.py) | Multiple formats |
| scrape | [`scrape_json_extraction_async.py`](examples/scrape/scrape_json_extraction_async.py) | Structured JSON extraction |
| scrape | [`scrape_pdf_async.py`](examples/scrape/scrape_pdf_async.py) | PDF document parsing |
| scrape | [`scrape_with_fetchconfig_async.py`](examples/scrape/scrape_with_fetchconfig_async.py) | JS rendering, stealth mode |
| extract | [`extract_basic_async.py`](examples/extract/extract_basic_async.py) | AI data extraction |
| extract | [`extract_with_schema_async.py`](examples/extract/extract_with_schema_async.py) | Extraction with JSON schema |
| search | [`search_basic_async.py`](examples/search/search_basic_async.py) | Web search |
| search | [`search_with_extraction_async.py`](examples/search/search_with_extraction_async.py) | Search + AI extraction |
| crawl | [`crawl_basic_async.py`](examples/crawl/crawl_basic_async.py) | Start and monitor a crawl |
| crawl | [`crawl_with_formats_async.py`](examples/crawl/crawl_with_formats_async.py) | Crawl with formats |
| monitor | [`monitor_basic_async.py`](examples/monitor/monitor_basic_async.py) | Create a page monitor |
| monitor | [`monitor_with_webhook_async.py`](examples/monitor/monitor_with_webhook_async.py) | Monitor with webhook |
| utilities | [`credits_async.py`](examples/utilities/credits_async.py) | Check credits and limits |
| utilities | [`health_async.py`](examples/utilities/health_async.py) | API health check |
| utilities | [`history_async.py`](examples/utilities/history_async.py) | Request history |
## Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `SGAI_API_KEY` | Your ScrapeGraphAI API key | — |
| `SGAI_API_URL` | Override API base URL | `https://v2-api.scrapegraphai.com/api` |
| `SGAI_DEBUG` | Enable debug logging (`"1"`) | off |
| `SGAI_TIMEOUT` | Request timeout in seconds | `120` |
## Development
```bash
uv sync
uv run pytest tests/ # unit tests
uv run pytest tests/test_integration.py # live API tests (requires SGAI_API_KEY)
uv run ruff check . # lint
```
## License
MIT - [ScrapeGraphAI](https://scrapegraphai.com)