{"id":23992769,"url":"https://github.com/scraperai/scraperai","last_synced_at":"2025-04-10T01:08:07.587Z","repository":{"id":226178174,"uuid":"710781843","full_name":"scraperai/scraperai","owner":"scraperai","description":"ScraperAI is an open-source, AI-powered tool designed to simplify web scraping for users of all skill levels.","archived":false,"fork":false,"pushed_at":"2024-09-07T17:54:09.000Z","size":16531,"stargazers_count":133,"open_issues_count":1,"forks_count":13,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-10T01:07:49.298Z","etag":null,"topics":["crawler","langchain","linkedin","openai","parser","parsing","python","requests","scraper","scraping","selenium"],"latest_commit_sha":null,"homepage":"https://docs.scraper-ai.com","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scraperai.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-27T12:31:46.000Z","updated_at":"2025-04-09T15:35:27.000Z","dependencies_parsed_at":"2024-04-25T21:29:05.815Z","dependency_job_id":"6cc7db30-0ca4-40ce-be67-ded4b3cfe59e","html_url":"https://github.com/scraperai/scraperai","commit_stats":null,"previous_names":["iakov-kaiumov/scraperai","scraperai/scraperai"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scraperai%2Fscraperai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scraperai%2Fscraperai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scraperai%2Fscraperai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scraperai%2Fscraperai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scraperai","download_url":"https://codeload.github.com/scraperai/scraperai/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248137888,"owners_count":21053775,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","langchain","linkedin","openai","parser","parsing","python","requests","scraper","scraping","selenium"],"created_at":"2025-01-07T20:17:59.983Z","updated_at":"2025-04-10T01:08:07.572Z","avatar_url":"https://github.com/scraperai.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003cimg alt=\"ScraperAI Logo\" height=\"150px\" src=\"https://raw.githubusercontent.com/scraperai/scraperai/main/images/logo.png\"\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\u003ch1 align=\"center\"\u003e\n  ScraperAI\n\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\n    ⚡ Scraping has never been easier ⚡\n\u003c/p\u003e\n\u003ch4 align=\"center\"\u003e\n  \u003ca href=\"https://docs.scraper-ai.com\"\u003eDocumentation\u003c/a\u003e |\n  \u003ca href=\"https://scraper-ai.com\"\u003eWebsite\u003c/a\u003e\n\u003c/h4\u003e\n\n[![pages-build-deployment](https://github.com/scraperai/scraperai/actions/workflows/pages/pages-build-deployment/badge.svg)](https://github.com/scraperai/scraperai/actions/workflows/pages/pages-build-deployment)\n[![Publish to pypi](https://github.com/scraperai/scraperai/actions/workflows/cd.yml/badge.svg)](https://github.com/scraperai/scraperai/actions/workflows/cd.yml)\n\n## What is ScraperAI\n\nScraperAI is an open-source, AI-powered tool designed to simplify web scraping for users of all skill levels. \nBy leveraging Large Language Models, such as ChatGPT, ScraperAI extracts data from web pages and generates \nreusable and shareable scraping recipes.\n\n### Features\n- Serializable \u0026 reusable Scraper Configs\n- Automatic data detection\n- Automatic XPATHs detection\n- Automatic pagination \u0026 page type detection\n- HTML minification\n- ChatGPT support\n- Custom LLMs support\n- Selenium support\n- Custom crawlers support\n\n\n### Installation\n\nInstall ScraperAI easily using pip or from the source.\n\nWith pip:\n```console\npip install scraperai\n```\nFrom source: \n```console\ngit clone https://github.com/scraperai/scraperai.git\npip install ./scraperai\n```\n\n### Getting Started\n\n#### Page Type Detector\n\nWeb pages are categorized into four types:\n\n- **Catalog**: Pages with similar repeating elements, such as product lists, articles, companies or table rows.\n- **Details**: Pages detailing information about a single product.\n- **Captcha**: Captcha pages that hinder scraping efforts. Currently, we do not provide solutions to circumvent captchas.\n- **Other**: All other page types not currently supported.\n\nScraperAI primarily uses page screenshots and the GPT-4 Vision model for page type determination, with a fallback algorithm for cases where screenshots or Vision model access is unavailable. Users can manually set the page type if known.\n\n#### Pagination Detector\nThis feature is applicable for catalog-type web pages, supporting:\n\n- `xpath`: Xpath of pagination buttons like \"Next page\", \"More\", etc.\n- `scroll`: Infinite scrolling.\n- `urls`: a list of URLs.\n\n#### Catalog Item Detector\nThis feature is specifically designed for catalog-type web pages. It identifies repeating elements that typically \nrepresent individual data items, such as products, articles, or companies. \nThese elements may appear as visually distinct cards or as rows within a table, facilitating the organized display of information.\n\n#### Fields Extractor\n\nThe Fields Extractor allows to detect relevant information on the page and then \nfind XPATHs that allows to extract this detected information efficiently.\nThis tool can be used to retrieve information from individual catalog item cards or from nested detailing pages.\nWe define two types of data fields within HTML page:\n\n- **Static fields:** Fields without explicit names, containing single or multiple values (e.g., product names or prices).\n- **Dynamic fields:** Fields with both names and values, typically formatted like table entries.\n\n#### Web Crawler\nOur WebCrawler is engineered to:\n\n- Access web pages.\n- Simulate human actions (clicking, scrolling).\n- Capture screenshots of web pages.\n\nSelenium webdriver is the default tool due to its convenience and ease of use, incorporating techniques to avoid most website blocks. \nUsers can implement their versions using other tools like PlayWright. \nThe requests package is also supported, albeit with some limitations.\n\n## Demo\n### Jupyter notebook\nWe put examples of basic scraper usage in the `/examples` folder. \nWe recommend to start from [YCombinator example](https://github.com/scraperai/scraperai/blob/main/examples/ycombinator_full.ipynb). \nIn this notebook we present two expirements:\n1. [List of YCombinator companies](https://www.ycombinator.com/companies/)\n2. [List of commits in the repository](https://github.com/scraperai/scraperai/commits/main/)\n\n\n### CLI Application\nScraperAI has a built-in CLI application. Simply run:\n```console\nscraperai --url https://www.ycombinator.com/companies\n```\nor simply\n```console\nscraperai\n```\n\nFollow the interactive process as ScraperAI attempts to auto-detect page types, pagination, catalog cards and data fields, \nallowing for manual correction of its detections.\nThe CLI currently supports only the OpenAI chat model, requiring an `openai_api_key`. \nIt can be provided via an environment variable, a `.env` file, or directly to the script.\n\nUse `scraperai --help`  for assistance.\n\n# Roadmap\nOur vision for ScraperAI's future includes:\n- Add httpx and aiohttp crawlers\n- Improve reciepts \u0026 prompts\n- Release SaaS web app\n- Improve prompts\n- Add support of different LLMs\n- Add [gpt4all](https://github.com/nomic-ai/gpt4all) integration\n- Add anti-captcha integration \n\nWe welcome feature requests and ideas from our community.\n\n# Contributing\nYour contributions are highly appreciated! Feel free to submit pull requests or issues.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscraperai%2Fscraperai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscraperai%2Fscraperai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscraperai%2Fscraperai/lists"}