{"id":26288697,"url":"https://github.com/kaymen99/ai-web-scraper","last_synced_at":"2026-04-28T09:33:16.576Z","repository":{"id":276887720,"uuid":"930636709","full_name":"kaymen99/ai-web-scraper","owner":"kaymen99","description":"AI web scraper built with Crawl4AI for extracting structured leads data from websites.","archived":false,"fork":false,"pushed_at":"2025-02-13T10:47:33.000Z","size":20,"stargazers_count":14,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-14T22:14:24.321Z","etag":null,"topics":["ai-agents","ai-scraping","crawl4ai","data-scraper","lead-generation","llms","scraper","web-scraper","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kaymen99.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-11T00:37:54.000Z","updated_at":"2025-03-13T02:53:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"ca38085f-6d07-4806-b6e3-a33fff938179","html_url":"https://github.com/kaymen99/ai-web-scraper","commit_stats":null,"previous_names":["kaymen99/llm-web-scraper","kaymen99/ai-web-scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaymen99%2Fai-web-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaymen99%2Fai-web-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaymen99%2Fai-web-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaymen99%2Fai-web-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kaymen99","download_url":"https://codeload.github.com/kaymen99/ai-web-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243652693,"owners_count":20325611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","ai-scraping","crawl4ai","data-scraper","lead-generation","llms","scraper","web-scraper","web-scraping"],"created_at":"2025-03-14T22:14:28.309Z","updated_at":"2025-12-30T12:56:37.272Z","avatar_url":"https://github.com/kaymen99.png","language":"Python","funding_links":[],"categories":["Lead Generation \u0026 Sales"],"sub_categories":["Lead Generation"],"readme":"# AI Web Scraper with Crawl4AI  \n\n### 👉 **[Learn How to Scrape and Build Lead Lists Easily with Crawl4AI!](https://dev.to/kaymen99/scrape-any-website-fast-and-cheap-with-crawl4ai-3fj1)**  \n\nThis project is an AI-powered web scraper built with [**Crawl4AI**](https://docs.crawl4ai.com/). It automates **lead generation** by extracting local business (Dentists, restaurents,...) names, addresses, phone numbers, and more from [**YellowPages**](https://www.yellowpages.ca/). With the help of LLMs like GPT-4o, Claude, and DeepSeek, it intelligently processes data and saves it in **CSV files**, making it ready for outreach or analysis!  \n\n## Features  \n\n- **Extract Business Information** – Scrape business names, contact details, and other key data.  \n- **AI-Powered Data Processing** – Use LLMs to clean, format, and enhance the extracted data.  \n- **Customizable Scraper** – Adapt it to different websites and data types.  \n- **Flexible LLM Integration** – Choose from AI models like GPT-4, Claude, and DeepSeek.  \n\n## Adaptability  \n\nThis scraper is designed for **YellowPages** but can be used on **any website**. You can change the target URL, modify the AI instructions to adjust how the data is processed, and define new data fields based on your needs.  \n\n## Potential Use Cases  \n\n- **Lead Generation** – Collect business emails, phone numbers, and addresses to build targeted outreach lists.  \n- **Market Research** – Gather real-time industry data to analyze trends and customer behavior.  \n- **Competitor Analysis** – Monitor pricing, services, and customer reviews to stay competitive.  \n- **AI Data Enrichment** – Use LLMs to clean and categorize data for better insights.  \n- **Research \u0026 Analysis** – Extract structured data from directories, reports, and other sources for business or academic studies.  \n\n## Project Structure\n\n```\n.\n├── main.py # Main entry point for the crawler\n├── config.py # Contains configuration constants (LLM Models, Base URL, CSS selectors, etc.)\n├── models\n│ └── business.py # Defines the Local Business data model using Pydantic\n├── src\n│ ├── utils.py # Utility functions for processing and saving data\n│ └── scraper.py # functions for configuring and running the crawler\n└── requirements.txt # Python package dependencies\n```\n\n# How to Run\n## Prerequisites\nEnsure you have the following installed:\n- Python 3.11+\n- LLM provider API key (OpenAI, Gemini, Claude,...)\n- Necessary Python libraries (listed in `requirements.txt`)\n\n## Setup\n### Clone the Repository\n```bash\ngit clone https://github.com/kaymen99/llm-web-scraper\ncd llm-web-scraper\n```\n\n### Create and Activate a Virtual Environment\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Windows use `venv\\Scripts\\activate`\n```\n\n### Install Required Packages\n```bash\npip install -r requirements.txt\nplaywright install\n```\n\n### Set Up Environment Variables\nCreate a `.env` file in the root directory and add necessary credentials:\n\n```ini\n# API keys for LLMs providers, add key for every provider you want to use\nOPENAI_API_KEY=\"\"            # OpenAI API key for accessing OpenAI's models and services\nGEMINI_API_KEY=\"\"            # Google Cloud API key for accessing Google Cloud services\nGROQ_API_KEY=\"\"              # GROQ platform API key for using GROQ's services\n```\n\n## Running the scraper\n\nTo start the scraper, run:\n\n```bash\npython main.py\n```\n\nThe script will crawl the specified website, extract data page by page, and save the complete venues to a `businesses_data.csv` file in the project directory. Additionally, usage statistics for the LLM strategy will be displayed after crawling.\n\n## Configuration  \n\nThe `config.py` file contains key settings for controlling the scraper's behavior. You can modify these values to customize the scraping process:  \n\n- **LLM_MODEL**: The AI model used for data extraction. Supports any LLM from **LiteLLM** (e.g., `gpt-4o`, `claude`, `deepseek-chat`, `gemini-2.0-flash`). \n- **BASE_URL**: The target website to scrape. By default, it extracts **dentists in Toronto** from Yellow Pages, but you can change this to any business category or location.  \n- **CSS_SELECTOR**: The HTML selector used to pinpoint business details within the page.  \n- **MAX_PAGES**: Limits the number of pages to crawl (default: `3`). Increase this value to scrape more data.  \n- **SCRAPER_INSTRUCTIONS**: Custom LLM prompt defining what details to extract .\n\n# Contributing\nContributions are welcome! Please open an issue or submit a pull request for any changes.\n\n# Contact\nIf you have any questions or suggestions, feel free to contact me at `aymenMir1001@gmail.com`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaymen99%2Fai-web-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkaymen99%2Fai-web-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaymen99%2Fai-web-scraper/lists"}