{"id":37202626,"url":"https://github.com/dir01/scrapeapi","last_synced_at":"2026-01-14T23:22:18.593Z","repository":{"id":310870059,"uuid":"1039042055","full_name":"dir01/scrapeapi","owner":"dir01","description":null,"archived":false,"fork":false,"pushed_at":"2025-08-20T17:21:03.000Z","size":492,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-08-20T19:31:26.633Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dir01.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-16T11:00:40.000Z","updated_at":"2025-08-20T17:21:07.000Z","dependencies_parsed_at":"2025-08-20T19:31:31.986Z","dependency_job_id":"10021789-b62c-4fdf-8d1e-b0e95810a4af","html_url":"https://github.com/dir01/scrapeapi","commit_stats":null,"previous_names":["dir01/scrapeapi"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/dir01/scrapeapi","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dir01%2Fscrapeapi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dir01%2Fscrapeapi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dir01%2Fscrapeapi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dir01%2Fscrapeapi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dir01","download_url":"https://codeload.github.com/dir01/scrapeapi/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dir01%2Fscrapeapi/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28437981,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T22:37:52.437Z","status":"ssl_error","status_checked_at":"2026-01-14T22:37:31.496Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-14T23:22:12.615Z","updated_at":"2026-01-14T23:22:16.318Z","avatar_url":"https://github.com/dir01.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ScrapeAPI\n\nA FastAPI service that wraps [scrapegraph-ai](https://github.com/VinciGit00/Scrapegraph-ai) for HTTP-based web scraping with AI-powered structured data extraction.\n\n## Features\n\n- 🤖 **AI-Powered Extraction**: Uses LLMs to understand and extract structured data from web pages\n- 📋 **JSON Schema Support**: Define output structure with JSON Schema for type-safe extraction\n- 🌐 **Multiple Graph Types**: Smart scraper, multi-source scraper, and search-based scraping\n- 🎯 **Browser Automation**: Handles dynamic content with Playwright/browser automation\n- 📊 **OpenTelemetry Observability**: Comprehensive tracing and metrics for monitoring\n- 🔧 **Go SDK**: Type-safe Go client library with automatic schema generation\n- 🐳 **Container Ready**: Docker support for easy deployment\n\n\u003e Supports: **SmartScraperGraph** (single page), **SmartScraperMultiGraph** (many pages), and **SearchGraph** (web discovery). Schema can be provided as a **JSON Schema object** with automatic Pydantic conversion.\n\n## Project Structure\n\n- `app/main.py` - FastAPI service with scraping endpoints\n- `app/telemetry.py` - OpenTelemetry configuration and instrumentation\n- `sdk/go/` - Go client library with type-safe schema support\n- `Dockerfile` - Container configuration with Playwright/Chromium\n- `.env.example` - Environment variables template\n\n## Quick Start\n\n### Using Docker\n\n```bash\n# Clone and build\ngit clone \u003crepo-url\u003e\ncd scrapeapi\ndocker build -t scrapeapi .\n\n# Run with OpenAI API key\ndocker run -p 8080:8080 -e OPENAI_API_KEY=sk-your-key scrapeapi\n\n# Test the API\ncurl http://localhost:8080/v1/health\n```\n\n### Local Development\n\n```bash\n# Install dependencies\nuv install\n\n# Set up environment\ncp .env.example .env\n# Edit .env with your API keys\n\n# Run the service\nuv run app/main.py\n```\n\n## API\n\n### Start a job (generic)\n\n`POST /v1/scrape`\n\n**Body** (examples below):\n\n```json\n{\n  \"graph\": \"smart\",\n  \"user_prompt\": \"Extract the product title and price\",\n  \"website_url\": \"https://example.com/product/123\",\n  \"schema\": {\n    \"type\": \"object\",\n    \"properties\": {\n      \"title\": {\"type\": \"string\"},\n      \"price\": {\"type\": \"string\"}\n    },\n    \"required\": [\"title\", \"price\"]\n  },\n  \"llm\": {\"model\": \"openai/gpt-4o-mini\", \"temperature\": 0},\n  \"headless\": true,\n  \"loader_kwargs\": {\"proxy\": {\"server\": \"http://user:pass@proxy:8080\"}},\n  \"timeout_sec\": 120\n}\n```\n\n**Response**:\n\n```json\n{\n  \"request_id\": \"uuid\",\n  \"status\": \"queued\",\n  \"graph\": \"smart\",\n  \"user_prompt\": \"...\",\n  \"website_url\": \"https://example.com/product/123\",\n  \"result\": null,\n  \"error\": \"\"\n}\n```\n\n### Poll a job\n\n`GET /v1/scrape/{request_id}`\n\n```json\n{\n  \"request_id\": \"uuid\",\n  \"status\": \"completed\",\n  \"graph\": \"smart\",\n  \"user_prompt\": \"...\",\n  \"result\": {\n    \"data\": {\"title\": \"...\", \"price\": \"...\"},\n    \"schema_validation\": {\"ok\": true}\n  },\n  \"error\": \"\"\n}\n```\n\n### Aliases (optional)\n\n* `POST /v1/smartscraper` (same as `/v1/scrape` with `graph=smart`)\n* `GET /v1/smartscraper/{request_id}` (poll)\n\n---\n\n## Example Requests\n\n### 1) **Smart** (single page)\n\n```bash\ncurl -s -X POST http://localhost:8080/v1/scrape \\\n -H 'Content-Type: application/json' \\\n -d '{\n  \"graph\": \"smart\",\n  \"user_prompt\": \"Extract job: title, company, location, apply_url\",\n  \"website_url\": \"https://boards.greenhouse.io/example/jobs/123\",\n  \"schema\": {\n    \"type\": \"object\",\n    \"properties\": {\n      \"title\": {\"type\": \"string\"},\n      \"company\": {\"type\": \"string\"},\n      \"location\": {\"type\": \"string\"},\n      \"apply_url\": {\"type\": \"string\", \"format\": \"uri\"}\n    },\n    \"required\": [\"title\", \"apply_url\"]\n  },\n  \"llm\": {\"model\": \"openai/gpt-4o-mini\", \"temperature\": 0},\n  \"headless\": true\n }'\n```\n\n### 2) **Multi** (many pages)\n\n```bash\ncurl -s -X POST http://localhost:8080/v1/scrape \\\n -H 'Content-Type: application/json' \\\n -d '{\n  \"graph\": \"multi\",\n  \"user_prompt\": \"For each page, extract a single job with title, company, location, apply_url\",\n  \"sources\": [\n    \"https://boards.greenhouse.io/example/jobs/123\",\n    \"https://jobs.lever.co/example/456\"\n  ],\n  \"schema\": {\"type\":\"object\",\"properties\":{\"jobs\":{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"},\"company\":{\"type\":\"string\"},\"location\":{\"type\":\"string\"},\"apply_url\":{\"type\":\"string\",\"format\":\"uri\"}}}}}},\n  \"llm\": {\"model\": \"openai/gpt-4o-mini\", \"temperature\": 0},\n  \"headless\": true,\n  \"max_results\": 20\n }'\n```\n\n### 3) **Search** (discovery)\n\n```bash\ncurl -s -X POST http://localhost:8080/v1/scrape \\\n -H 'Content-Type: application/json' \\\n -d '{\n  \"graph\": \"search\",\n  \"user_prompt\": \"Find recent postings for Go developer (remote or Tbilisi) and return list of job page URLs\",\n  \"schema\": {\"type\":\"object\",\"properties\":{\"jobs\":{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"},\"source_url\":{\"type\":\"string\",\"format\":\"uri\"}}}}}},\n  \"llm\": {\"model\": \"openai/gpt-4o-mini\", \"temperature\": 0}\n }'\n```\n\n---\n\n## Notes \u0026 Tips\n\n* **Schema input**: You can pass either a *JSON Schema object* (validated if `jsonschema` is installed) or a *JSON-like example string*. If omitted, scrapegraph-ai decides the shape.\n* **Providers**: Set `llm` according to your provider (OpenAI, Anthropic, Mistral, Google, Together, or local **Ollama**). Keys can come from env vars or inline via `llm:{ api_key: ... }`.\n* **JS pages**: Enable Chromium by setting `headless: true` and optionally `loader_kwargs.proxy` for geo/rate limits.\n* **Raw HTML**: If you can’t hit the URL, send `website_html`; the server writes a temp `.html` file and scrapes that.\n* **Persistence**: This demo stores jobs in memory. Swap `JOBS` for Redis/Postgres for production.\n* **Timeouts**: Control with `timeout_sec` per request.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdir01%2Fscrapeapi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdir01%2Fscrapeapi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdir01%2Fscrapeapi/lists"}