{"id":13419265,"url":"https://github.com/jamesturk/scrapeghost","last_synced_at":"2025-05-15T05:04:13.482Z","repository":{"id":143585313,"uuid":"615557134","full_name":"jamesturk/scrapeghost","owner":"jamesturk","description":"👻 Experimental library for scraping websites using OpenAI's GPT API.","archived":false,"fork":false,"pushed_at":"2024-10-09T23:20:26.000Z","size":1745,"stargazers_count":1432,"open_issues_count":7,"forks_count":86,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-04-03T16:03:57.105Z","etag":null,"topics":["gpt","openai-api","webscraping"],"latest_commit_sha":null,"homepage":"https://jamesturk.github.io/scrapeghost/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jamesturk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":"docs/code_of_conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["jamesturk"]}},"created_at":"2023-03-18T01:37:10.000Z","updated_at":"2025-03-30T13:44:38.000Z","dependencies_parsed_at":"2023-11-24T23:29:23.648Z","dependency_job_id":"fb2b216b-f182-45a4-a8a8-c62c9e95c56b","html_url":"https://github.com/jamesturk/scrapeghost","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jamesturk%2Fscrapeghost","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jamesturk%2Fscrapeghost/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jamesturk%2Fscrapeghost/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jamesturk%2Fscrapeghost/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jamesturk","download_url":"https://codeload.github.com/jamesturk/scrapeghost/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248313547,"owners_count":21082876,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpt","openai-api","webscraping"],"created_at":"2024-07-30T22:01:13.595Z","updated_at":"2025-04-10T23:24:41.205Z","avatar_url":"https://github.com/jamesturk.png","language":"Python","funding_links":["https://github.com/sponsors/jamesturk"],"categories":["Python","Openai"],"sub_categories":[],"readme":"# scrapeghost\n\n![scrapeghost logo](docs/assets/scrapeghost.png)\n\n`scrapeghost` is an experimental library for scraping websites using OpenAI's GPT.\n\nSource: [https://github.com/jamesturk/scrapeghost](https://github.com/jamesturk/scrapeghost)\n\nDocumentation: [https://jamesturk.github.io/scrapeghost/](https://jamesturk.github.io/scrapeghost/)\n\nIssues: [https://github.com/jamesturk/scrapeghost/issues](https://github.com/jamesturk/scrapeghost/issues)\n\n[![PyPI badge](https://badge.fury.io/py/scrapeghost.svg)](https://badge.fury.io/py/scrapeghost)\n[![Test badge](https://github.com/jamesturk/scrapeghost/workflows/Test%20\u0026%20Lint/badge.svg)](https://github.com/jamesturk/scrapeghost/actions?query=workflow%3A%22Test+%26+Lint%22)\n\n**Use at your own risk. This library makes considerably expensive calls ($0.36 for a GPT-4 call on a moderately sized page.) Cost estimates are based on the [OpenAI pricing page](https://beta.openai.com/pricing) and not guaranteed to be accurate.**\n\n![](screenshot.png)\n\n## Features\n\nThe purpose of this library is to provide a convenient interface for exploring web scraping with GPT.\n\nWhile the bulk of the work is done by the GPT model, `scrapeghost` provides a number of features to make it easier to use.\n\n**Python-based schema definition** - Define the shape of the data you want to extract as any Python object, with as much or little detail as you want.\n\n**Preprocessing**\n\n* **HTML cleaning** - Remove unnecessary HTML to reduce the size and cost of API requests.\n* **CSS and XPath selectors** - Pre-filter HTML by writing a single CSS or XPath selector.\n* **Auto-splitting** - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped.\n\n**Postprocessing**\n\n* **JSON validation** - Ensure that the response is valid JSON.  (With the option to kick it back to GPT for fixes if it's not.)\n* **Schema validation** - Go a step further, use a [`pydantic`](https://pydantic-docs.helpmanual.io/) schema to validate the response.\n* **Hallucination check** - Does the data in the response truly exist on the page?\n\n**Cost Controls**\n\n* Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.\n* Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)\n* Allows setting a budget and stops the scraper if the budget is exceeded.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjamesturk%2Fscrapeghost","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjamesturk%2Fscrapeghost","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjamesturk%2Fscrapeghost/lists"}