{"id":21836756,"url":"https://github.com/bes-dev/gpt-scraper","last_synced_at":"2025-04-14T09:41:22.345Z","repository":{"id":260005545,"uuid":"879974242","full_name":"bes-dev/gpt-scraper","owner":"bes-dev","description":"An autonomous LLM-based agent that generates code to extract structured information from web pages and extracts it.","archived":false,"fork":false,"pushed_at":"2024-10-30T12:13:26.000Z","size":42,"stargazers_count":10,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-08T16:48:10.998Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bes-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-28T22:11:29.000Z","updated_at":"2025-03-04T13:08:02.000Z","dependencies_parsed_at":"2024-10-29T02:47:13.675Z","dependency_job_id":"5278e492-17f6-4e55-864a-cb3b648f7db4","html_url":"https://github.com/bes-dev/gpt-scraper","commit_stats":null,"previous_names":["bes-dev/gpt-scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bes-dev%2Fgpt-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bes-dev%2Fgpt-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bes-dev%2Fgpt-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bes-dev%2Fgpt-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bes-dev","download_url":"https://codeload.github.com/bes-dev/gpt-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248855769,"owners_count":21172640,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-27T20:42:27.308Z","updated_at":"2025-04-14T09:41:22.320Z","avatar_url":"https://github.com/bes-dev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GPT-Scraper\n\nGPT-Scraper is an autonomous, LLM-based agent that generates code to extract structured information from web pages.\nIt is specifically designed to facilitate the process of web scraping using advanced language models such as GPT-4.\nThis project aims to simplify the extraction of data from web pages by converting user-defined requirements into Python code that executes the desired web scraping tasks.\n\n## Features\n\n- **Dynamic Code Generation**: Generates Python parsing code based on user requirements and webpage content.\n- **Flexible Data Structures**: Supports the use of Pydantic models to define the structure of the scraped data.\n- **Webpage Source Handling**: Capable of extracting HTML content from both static and dynamic web pages using Selenium.\n\n## Installation\n\n### Prerequisites\n\n- **Python 3.6 or higher**: Ensure you have Python installed. You can download it from the [official website](https://www.python.org/downloads/).\n- **ChromeDriver**: Selenium requires ChromeDriver to interact with the Chrome browser. Download it from [here](https://sites.google.com/a/chromium.org/chromedriver/downloads) and ensure it's in your system's PATH.\n\n### Install from git\n\n```bash\n$ pip install git+https://github.com/bes-dev/gpt-scraper.git\n```\n\n### Install from pip\n\n```bash\n$ pip install gpt-scraper\n```\n\n## CLI Tool Usage\n\n### Commands\n\n```bash\n$ gpt-scraper --help\nusage: gpt-scraper [-h] (--requirements REQUIREMENTS | --scraper-file SCRAPER_FILE) --url URL [--output OUTPUT] [--wait-by {id,xpath,css_selector}] [--wait-value WAIT_VALUE]\n                   [--save-file SAVE_FILE] [--model-name MODEL_NAME] [--simplify-html] [--use-sandbox]\n\nGPT-Scraper CLI\n\noptions:\n  -h, --help            show this help message and exit\n  --requirements REQUIREMENTS\n                        Scraping requirements\n  --scraper-file SCRAPER_FILE\n                        Path to the scraper file to load\n  --url URL             URL of the webpage to scrape\n  --output OUTPUT       Output file path to save scraped data as JSON\n  --wait-by {id,xpath,css_selector}\n                        Type of locator to wait for\n  --wait-value WAIT_VALUE\n                        Value of the locator to wait for\n  --save-file SAVE_FILE\n                        Path to save the created GPTScraper to file\n  --model-name MODEL_NAME\n                        Name of the model to use for scraping\n  --simplify-html       Simplify the HTML content before parsing\n  --use-sandbox         Use the sandboxed environment for parsing\n\n```\n\n### Sample session\n\n```bash\n$ gpt-scraper --url https://news.ycombinator.com/ --requirements 'extract threads list from the web page (extract link and title)' --save-file hn.py --model-name o1-mini\n2024-10-29 05:23:25,989 [INFO] Fetching page content from URL: https://news.ycombinator.com/\n2024-10-29 05:23:25,989 [INFO] Attempt 1 to fetch URL: https://news.ycombinator.com/\n2024-10-29 05:23:27,915 [INFO] Successfully fetched page source for URL: https://news.ycombinator.com/\n2024-10-29 05:23:27,977 [INFO] Generating parser using GPTScraper.\n2024-10-29 05:23:34,517 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n2024-10-29 05:23:34,605 [INFO] Saving scraper to file: hn.json\n2024-10-29 05:23:34,605 [INFO] Scraper saved successfully.\n2024-10-29 05:23:34,605 [INFO] Parsing HTML content.\n2024-10-29 05:23:34,638 [INFO] Printing scraped data:\n[\n    {\n        \"title\": \"Excel Turing Machine (2013)\",\n        \"link\": \"https://www.felienne.com/archives/2974\"\n    },\n    {\n        \"title\": \"High-resolution postmortem human brain MRI at 7 tesla\",\n        \"link\": \"https://pulkit-khandelwal.github.io/exvivo-brain-upenn/\"\n    },\n    {\n        \"title\": \"How Gothic architecture became spooky\",\n        \"link\": \"https://www.architecturaldigest.com/story/how-gothic-architecture-became-spooky\"\n    },\n    {\n        \"title\": \"Using reinforcement learning and $4.80 of GPU time to find the best HN post\",\n        \"link\": \"https://openpipe.ai/blog/hacker-news-rlhf-part-1\"\n    }\n]\n```\n\n## Example\n\n```python\nfrom gpt_scraper import GPTScraper\nfrom gpt_scraper.selenium_utils import fetch_dynamic_page\nfrom pydantic import BaseModel\n\nclass Data(BaseModel):\n    title: str\n    url: str\n\npage_source = fetch_dynamic_page(\"https://news.ycombinator.com/\")\nscraper = GPTScraper.from_html(\n    page_source,\n    \"extract threads list from the web page (extract link and title)\",\n    data_structure=Data,\n    model_name=\"o1-mini\"\n)\ndata = scraper.parse_html(page_source, use_sandbox=True)\nprint(data)\n```\n\n\n# Disclaimer\nThis application assists users in generating code with AI.\nWhile a sandbox environment with limited system access is provided for added security, we cannot guarantee complete protection.\nWe strongly recommend executing all generated code within the provided sandbox environment to help minimize potential risks.\nHowever, users should not rely on the sandbox as an absolute security measure.\n\nThe development team is not liable for any consequences resulting from the generated code, including system damage, data loss, or any incurred losses.\nBy using this application, you acknowledge and accept all risks associated with the generated code and assume full responsibility for any potential impact on your system.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbes-dev%2Fgpt-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbes-dev%2Fgpt-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbes-dev%2Fgpt-scraper/lists"}