{"id":24884175,"url":"https://github.com/hmshb/scraping-agent-ai","last_synced_at":"2025-07-28T17:08:22.733Z","repository":{"id":274948742,"uuid":"924510490","full_name":"hmshb/scraping-agent-ai","owner":"hmshb","description":"AI-powered web scraping agent built with LangGraph, LangSmith, Firecrawl, and Anthropic AI. Automates intelligent crawling, structured data extraction, and LLM-powered content formatting. Efficiently handles anti-bot mechanisms, error recovery, and batch processing. 🚀","archived":false,"fork":false,"pushed_at":"2025-02-12T07:43:08.000Z","size":479,"stargazers_count":0,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-04T23:09:10.222Z","etag":null,"topics":["agentic-ai","ai","ai-agent","ai-agents","anthropic-claude","bots","firecrawl","generative-ai","langchain","langgraph","llms","nlp","scraper","scrapping-php","scrapping-python","web","web-scraper","web-scraping","workflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hmshb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-30T06:18:10.000Z","updated_at":"2025-03-23T08:03:02.000Z","dependencies_parsed_at":"2025-06-04T19:08:44.074Z","dependency_job_id":"af6df3fb-3fa9-40d4-a9e7-537f5ba6ab70","html_url":"https://github.com/hmshb/scraping-agent-ai","commit_stats":null,"previous_names":["hmshb/scraping-agent-ai"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/hmshb/scraping-agent-ai","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmshb%2Fscraping-agent-ai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmshb%2Fscraping-agent-ai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmshb%2Fscraping-agent-ai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmshb%2Fscraping-agent-ai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hmshb","download_url":"https://codeload.github.com/hmshb/scraping-agent-ai/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hmshb%2Fscraping-agent-ai/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267552097,"owners_count":24106000,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-28T02:00:09.689Z","response_time":68,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","ai","ai-agent","ai-agents","anthropic-claude","bots","firecrawl","generative-ai","langchain","langgraph","llms","nlp","scraper","scrapping-php","scrapping-python","web","web-scraper","web-scraping","workflow"],"created_at":"2025-02-01T14:19:39.678Z","updated_at":"2025-07-28T17:08:22.628Z","avatar_url":"https://github.com/hmshb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🚀 Scraping Agent AI\n\n![Python](https://img.shields.io/badge/Python-3.8%2B-blue) ![Web Scraping](https://img.shields.io/badge/Web%20Scraping-AI-red) ![Automation](https://img.shields.io/badge/Automation-Smart-green)\n\nAn AI-powered **web scraping agent** that automates data extraction from websites with intelligent crawling, anti-bot detection, and structured data parsing. Built using **LangGraph**, **LangSmith**, **Firecrawl**, and **Anthropic AI tools** for seamless AI-driven web scraping and structured data processing.\n\n## 🔥 Features\n- **Graph-based AI Agent**: Uses LangGraph for managing scraping workflows.\n- **Intelligent Web Crawling**: Powered by Firecrawl to extract structured data.\n- **LLM-Powered Formatting**: Uses Anthropic AI for content summarization.\n- **Adaptive Error Handling**: Retries failed requests dynamically.\n- **Batch Processing**: Efficiently processes multiple URLs with batched requests.\n- **Flexible Output Formats**: Supports JSON, Markdown, and more.\n\n---\n\n[![Demo](https://img.shields.io/badge/Demo-Live-blue?style=for-the-badge)](demo.gif)\n\n![demo.png](demo.png)\n\n---\n\n## 🛠️ Setup Instructions\n\nFollow these steps to set up and run the project on your local machine:\n\n### 1. Clone the Repository\n```bash\ngit clone https://github.com/hmshb/scraping-agent-ai\ncd scraping-agent-ai\n```\n\n### 2. Create a Virtual Environment\n```bash\npython -m venv venv\nsource venv/bin/activate  # For Linux/Mac\n\n.\\venv\\Scripts\\activate # For Windows\n```\n\n### 3. Install LangGraph CLI\n```bash\npip install -U \"langgraph-cli[inmem]\"\n```\n\n![img.png](img.png)\n\n### 4. Install Other Dependencies\n```bash\npip install -e .\n```\n\n![img_1.png](img_1.png)\n\n---\n\n### 5. Generate LangSmith API Key\n1. Visit [LangSmith](https://smith.langchain.com/settings).\n2. Create an API key for accessing LangSmith logs.\n3. Copy the generated API key.\n\n![img_2.png](img_2.png)\n\n---\n\n### 6. Generate Anthropic Claude API Key\n1. Visit [Anthropic](https://console.anthropic.com/settings/keys).\n2. Create an API key for accessing Claude.\n3. Copy the generated API key.\n\n![img_3.png](img_3.png)\n\n---\n\n### 7. Configure the Environment Variables\n - Copy the `.env.example` file and rename it to `.env`:\n   ```bash\n   cp .env.example .env\n   ```\n - Open the `.env` file and update the API keys and configuration values:\n   ```\n   LANGSMITH_PROJECT=scrapping-agent\n   LANGSMITH_API_KEY=your_api_key_here\n   ANTHROPIC_API_KEY=your_api_key_here\n   FIRECRAWL_API_KEY=your_api_key_here\n   URL_LIMIT=10\n   BATCH_LIMIT=5\n   ```\n\n### 8. Run the project\n```bash\nlanggraph dev\n```\n\n![img_4.png](img_4.png)\n\n---\n\n### 9. LangGraph of the AI Agent\n\n![img_5.png](img_5.png)\n\n---\n\n## 📂 Project Structure\n```\nscraping-agent-ai/\n├── .env                 # API key configuration file\n├── agent/               # Main AI scraping agent module\n│   ├── utils/           # Utility modules for various tasks\n│   │   ├── constants.py # Constants for scraping tasks\n│   │   ├── firecrawl.py # Firecrawl integration\n│   │   ├── graph.py     # LangGraph-based workflow\n│   │   ├── helpers.py   # Utility functions\n│   │   ├── llm.py       # LLM-powered formatting\n│   │   ├── nodes.py     # Graph-based nodes\n│   │   ├── states.py    # Scraping state management\n│   ├── agent.py         # AI-driven scraping workflow\n├── langgraph.json       # LangGraph configuration file\n├── pyproject.toml       # Python project metadata\n├── README.md            # Documentation file\n├── scraped_data.json    # This will have the final data\n├── venv/                # Virtual environment\n```\n\n---\n\n## ⭐ Acknowledgments\n\nSpecial thanks to:\n\n- **[LangGraph](https://langchain-ai.github.io/langgraph/)** for building graph-based AI workflows.\n- **[LangSmith](https://www.langchain.com/langsmith)** for debugging and monitoring AI agents.\n- **[Firecrawl](https://firecrawl.com/)** for powerful web crawling and data extraction.\n- **[Anthropic AI](https://www.anthropic.com/claude)** for AI-powered text summarization and formatting.\n\n---\n\n## 📜 License\n\nThis project is open-source and licensed under the [MIT License](LICENSE).\n\n---\n\n## 📢 Get Involved!\n\nIf you find this repository helpful, please consider:\n\n- ⭐ **Starring the Repository** to show your support.\n- 📤 **Forking the Repository** to explore further and make your own customizations.\n- 💬 **Sharing Your Feedback** by opening issues or discussions.\n\n---\n\n## 📝 Notes\n\n**LangGraph**, **LangSmith**, **Claude** and **FireCrawl** is currently in limited or preview release (depending on your region and timing), and integration details may change as the service evolves. \n \nAlways refer to official documentation for the most up-to-date guidance.\n\n### Let's build smart, scalable AI-powered web scrapers together! 🚀\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhmshb%2Fscraping-agent-ai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhmshb%2Fscraping-agent-ai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhmshb%2Fscraping-agent-ai/lists"}