{"id":27717019,"url":"https://github.com/carlosplanchon/spidercreator","last_synced_at":"2025-09-15T13:04:38.728Z","repository":{"id":285016797,"uuid":"933816617","full_name":"carlosplanchon/spidercreator","owner":"carlosplanchon","description":"Automated web scraping spider generation using Browser Use and LLMs. Streamline the creation of Playwright-based spiders with minimal manual coding. Ideal for large enterprises with recurring data extraction needs.","archived":false,"fork":false,"pushed_at":"2025-06-18T02:24:12.000Z","size":6696,"stargazers_count":82,"open_issues_count":5,"forks_count":11,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-08-23T02:01:11.871Z","etag":null,"topics":["ai","automation","browser-use","crawling","llm","low-code","no-code","python","rpa","scraping","spider","vibe-coding"],"latest_commit_sha":null,"homepage":"http://spidercreator.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/carlosplanchon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-16T18:49:51.000Z","updated_at":"2025-08-19T10:34:38.000Z","dependencies_parsed_at":null,"dependency_job_id":"6564e3a8-9b27-4781-ba27-f64da9a5174a","html_url":"https://github.com/carlosplanchon/spidercreator","commit_stats":null,"previous_names":["carlosplanchon/spidercreator"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/carlosplanchon/spidercreator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carlosplanchon%2Fspidercreator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carlosplanchon%2Fspidercreator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carlosplanchon%2Fspidercreator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carlosplanchon%2Fspidercreator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/carlosplanchon","download_url":"https://codeload.github.com/carlosplanchon/spidercreator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carlosplanchon%2Fspidercreator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275260506,"owners_count":25433382,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-15T02:00:09.272Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","automation","browser-use","crawling","llm","low-code","no-code","python","rpa","scraping","spider","vibe-coding"],"created_at":"2025-04-27T03:01:40.932Z","updated_at":"2025-09-15T13:04:38.717Z","avatar_url":"https://github.com/carlosplanchon.png","language":"Python","funding_links":[],"categories":["AI Web Scrapers/Crawlers","Specific Applications"],"sub_categories":["Dev Tools"],"readme":"![Spider Creator Banner](assets/spidercreator_banner.png)\n\n\u003ch1 align=\"center\"\u003eGenerate Playwright Spiders with AI.\u003c/h1\u003e\n\n[![GitHub stars](https://img.shields.io/github/stars/carlosplanchon/spidercreator?style=social)](https://github.com/carlosplanchon/spidercreator/stargazers)\n[![Discord](https://img.shields.io/discord/1339895894434123777?color=7289DA\u0026label=Discord\u0026logo=discord\u0026logoColor=white)](https://discord.gg/vxJFUhvgfh)\n[![Cloud](https://img.shields.io/badge/Cloud-☁️-blue)](https://services.carlosplanchon.com/spidercreator/)\n[![Twitter Follow](https://img.shields.io/twitter/follow/carlosplanchon?style=social)](https://x.com/carlosplanchon)\n\n\u003cp align=\"center\"\u003e\u003cstrong\u003eAutomated web scraping spider generation using Browser Use and LLMs.\u003cbr\u003e\nGenerate Playwright spiders with minimal technical expertise.\u003c/strong\u003e\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eTHIS LIBRARY IS HIGHLY EXPERIMENTAL\u003c/h1\u003e\n\n## Rationale\n\nExtracting data with LLMs is expensive. Spider Creator offers an alternative where LLMs are only used for the spider creation process, and the spiders themselves can then run using traditional methods, which are very cheap.\nThis tradeoff makes it ideal for users seeking affordable, recurring scraping tasks.\n\n## Costs:\nWe don't have benchmarks yet. However, in internal tests using GPT, initial figures show a cost of around $2.50 per page. If you need to create a spider that takes into account two types of pages (for example, product listings and product details), this would cost around $5.\n\n### DeepWiki Docs: [https://deepwiki.com/carlosplanchon/spidercreator](https://deepwiki.com/carlosplanchon/spidercreator)\n\n## 🚀 Quick Start\n\n#### We recommend using Python 3.13.\n\n⚠️ This package is not on PyPI yet. To get started, clone the repo and run your code from its main folder:\n\n```bash\ngit clone https://github.com/carlosplanchon/spidercreator.git\ncd spidercreator\n\n# If you are using uv just run:\nuv sync\n\n# If you run pyenv:\npip install -r requirements.txt\n```\n\nInstall Playwright: \n```bash\nplaywright install chromium\n```\n\nExport your OpenAI API KEY:\n\n```bash\nexport OPENAI_API_KEY=...\n```\n\nGenerate your Playwright spider:\n\n```python\nfrom spidercreator import create_spider\n\n# Define the task prompt\nPRODUCT_LISTING_TASK_PROMPT = \"\"\"\nNavigate to {url} homepage.\nExtract all products on the homepage with its visible attributes.\n\nSelect a small sample of products (e.g., 3–5) on the homepage.\nFor each selected product:\nExtract all visible attributes (price, description, brand, stock status, images, etc.).\nIf a dedicated product page is available (e.g., \"View details\" link), click through and capture any additional attributes.\nStop after you've collected enough products to demonstrate the data extraction (3 to 5 products).\n\"\"\"\n\n# Uruguayan supermarket with product listings:\nurl = \"https://tiendainglesa.com.uy/\"\n\nbrowser_use_task = PRODUCT_LISTING_TASK_PROMPT.format(url=url)\n\n# This function generates a spider\n# and saves it to results/\u003ctask_id\u003e/spider_code.py\ntask_id: str = create_spider(browser_use_task=browser_use_task)\n```\n\nResult:\n\n```python\nfrom playwright.sync_api import sync_playwright\nfrom parsel import Selector\n\nimport prettyprinter\n\nprettyprinter.install_extras()\n\n\nclass TiendaInglesaScraper:\n    def __init__(self, base_url):\n        self.base_url = base_url\n\n    def fetch(self, page, url):\n        page.goto(url)\n        page.wait_for_load_state('networkidle')\n        return page.content()\n\n    def parse_homepage(self, html_content, page):\n        selector = Selector(text=html_content)\n        product_containers = selector.xpath(\"//div[contains(@class,'card-product-container')]\")\n\n        products = []\n\n        for container in product_containers:\n            product = {}\n\n            product['name'] = container.xpath(\".//span[contains(@class,'card-product-name')]/text()\").get('').strip()\n            relative_link = container.xpath(\".//a/@href\").get()\n            product['link'] = self.base_url.rstrip('/') + relative_link if relative_link else None\n            product['discount'] = container.xpath(\".//ul[contains(@class,'card-product-promo')]//li[contains(@class,'card-product-badge')]/text()\").get('').strip()\n            product['price_before'] = container.xpath(\".//span[contains(@class,'wTxtProductPriceBefore')]/text()\").get('').strip()\n            product['price_after'] = container.xpath(\".//span[contains(@class,'ProductPrice')]/text()\").get('').strip()\n\n            if product['link']:\n                detailed_html = self.fetch(page, product['link'])\n                detailed_attrs = self.parse_product_page(detailed_html)\n                product.update(detailed_attrs)\n\n            print(\"--- PRODUCT ---\")\n            prettyprinter.cpprint(product)\n\n            products.append(product)\n\n        return products\n\n    def parse_product_page(self, html_content):\n        selector = Selector(text=html_content)\n\n        detailed_attributes = {}\n\n        detailed_attributes['description'] = selector.xpath(\"//span[contains(@class, 'ProductDescription')]/text()\").get('').strip()\n\n        return detailed_attributes\n\n\nif __name__ == \"__main__\":\n    base_url = 'https://www.tiendainglesa.com.uy/'\n\n    scraper = TiendaInglesaScraper(base_url)\n\n    with sync_playwright() as p:\n        browser = p.chromium.launch(headless=False)\n        page = browser.new_page()\n\n        homepage_html = scraper.fetch(page, base_url)\n        products_data = scraper.parse_homepage(homepage_html, page)\n\n        browser.close()\n\n    for idx, product in enumerate(products_data, 1):\n        print(f\"\\nProduct {idx}:\")\n        for key, value in product.items():\n            print(f\"{key.title().replace('_', ' ')}: {value}\")\n```\n\n### Examples\n\nFor more working examples, check the [examples folder](examples/)\n\n## ⚙️ How It Works\n\n### Main Workflow\n\nThe user provides a prompt describing the desired task or data.\n1. The system, leveraging Browser Use opens a browser and performs the task based on the prompt. The browser activity is recorded.\n2. The Spider Creator module generates a web scraper (spider) from the recorded actions.\n\n### Spider Creator Algorithm\n\nSteps:\n1. Load Browser Use recordings.\n2. Generate mindmap to visualize web navigation process.\n3. Make a multi-stage plan on how to generate xpaths.\n4. For each stage of the plan:\n    1. Given the website HTML, generate a compressed \u0026 chunked DOM representation.\n    2. Traverse the compressed DOM chunk by chunk, select Regions of Interest, and generate candidate spiders based on the intention described in the planning stage.\n    3. Execute candidate spiders in a virtual execution enviroment (ctxexec)\n    4. Verify execution results and select the best candidate spider for this stage.\n5. Combine spiders from each stage into a final spider.\n6. Save spider code and return task_id.\n\n## Contributing\n\nWe love contributions! Feel free to open issues for bugs or feature requests.\n\n\u003cdiv align=\"center\"\u003e\nMade with ❤️ in Dolores, Uruguay!\n \u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcarlosplanchon%2Fspidercreator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcarlosplanchon%2Fspidercreator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcarlosplanchon%2Fspidercreator/lists"}