{"id":26896095,"url":"https://github.com/rahulsamant37/ai-scraper","last_synced_at":"2025-08-22T13:03:49.161Z","repository":{"id":270039543,"uuid":"909158296","full_name":"rahulsamant37/AI-Scraper","owner":"rahulsamant37","description":"Universal Web Scraping AI Processing Pipeline: A dynamic, AI-powered web scraping and data extraction system with multi-model support, advanced text processing, and flexible output options for efficient data analysis.","archived":false,"fork":false,"pushed_at":"2024-12-27T23:59:13.000Z","size":1414,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-01T02:59:34.333Z","etag":null,"topics":["gemini-api","groq-api","langchain-python","playwright-python","pydantic","rag","selenium-python","streamlit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rahulsamant37.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-27T22:12:48.000Z","updated_at":"2025-03-24T02:15:44.000Z","dependencies_parsed_at":"2025-08-22T13:01:57.252Z","dependency_job_id":null,"html_url":"https://github.com/rahulsamant37/AI-Scraper","commit_stats":null,"previous_names":["rahulsamant37/ai-scraper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rahulsamant37/AI-Scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2FAI-Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2FAI-Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2FAI-Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2FAI-Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rahulsamant37","download_url":"https://codeload.github.com/rahulsamant37/AI-Scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2FAI-Scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271643440,"owners_count":24795440,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gemini-api","groq-api","langchain-python","playwright-python","pydantic","rag","selenium-python","streamlit"],"created_at":"2025-04-01T02:59:37.642Z","updated_at":"2025-08-22T13:03:49.046Z","avatar_url":"https://github.com/rahulsamant37.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🌟 Universal Web Scraping - AI Processing Pipeline\r\n\r\n## 🎓 Infosys Springboard Internship\r\n\r\nExcited to present the completion of my Infosys Springboard Internship Milestone 3! This project combines advanced web scraping with AI-powered data processing to create a flexible, robust data extraction pipeline.\r\n\r\n## 🚀 Features\r\n\r\n- **Multi-Provider AI Integration**: Support for OpenAI, Google Gemini, Llama, and Groq\r\n- **Smart Web Scraping**: Selenium-based scraping with intelligent scroll handling\r\n- **Advanced Text Processing**: Customizable chunking with overlap control\r\n- **Dynamic Model Generation**: Creates data models based on user-defined fields\r\n- **Multiple Export Formats**: JSON, CSV, Excel, and Markdown output options\r\n- **Cost Tracking**: Automated token counting and cost calculation\r\n- **User-Friendly Interface**: Streamlit-based UI with intuitive controls\r\n\r\n## 🛠️ Technologies and Tools Used\r\n\r\n- **Python**: Core scripting language for logic and data handling\r\n- **Selenium \u0026 Playwright**: Dynamic web scraping and content handling\r\n- **Pydantic**: Data processing, model generation, and validation\r\n- **Streamlit**: Creating an intuitive and interactive user interface\r\n- **LangChain \u0026 LangSmith**: For structured AI-driven data extraction and workflow tracking\r\n- **ChatGoogleGenerativeAI \u0026 ChatGroq**: Enhancing AI model efficiency and accuracy\r\n\r\n## 📊 System Architecture\r\n\r\n### Dynamic Container Model\r\n```mermaid\r\ngraph TD\r\n    A[User Input Fields] --\u003e|Example Input| B[\"Fields = ['price', 'title', 'description']\"]\r\n    \r\n    subgraph Dynamic_Listing_Model[Dynamic Listing Model Creation]\r\n        B --\u003e C[Create Single Item Structure]\r\n        C --\u003e|Creates| D[Pydantic Model]\r\n        D --\u003e E[\"Single Item Schema:\r\n        {\r\n            'price': string,\r\n            'title': string,\r\n            'description': string\r\n        }\"]\r\n    end\r\n    \r\n    subgraph Container_Model[Container Model Creation]\r\n        E --\u003e F[Create Container Structure]\r\n        F --\u003e|Wraps Items| G[\"Final Schema:\r\n        {\r\n            'listings': [\r\n                {item1},\r\n                {item2},\r\n                {item3},\r\n                ...\r\n            ]\r\n        }\"]\r\n    end\r\n    \r\n    H[Real World Example] --\u003e I[\"User wants to scrape:\r\n    - Product Name\r\n    - Price\r\n    - Rating\"]\r\n    \r\n    I --\u003e J[\"Creates Model:\r\n    {\r\n        'listings': [\r\n            {\r\n                'Product Name': 'iPhone 13',\r\n                'Price': '$799',\r\n                'Rating': '4.5'\r\n            },\r\n            {\r\n                'Product Name': 'Galaxy S21',\r\n                'Price': '$699',\r\n                'Rating': '4.3'\r\n            }\r\n        ]\r\n    }\"]\r\n    \r\n    style Dynamic_Listing_Model fill:#ffd,stroke:#333\r\n    style Container_Model fill:#dff,stroke:#333\r\n```\r\n\r\n### AI Processing Pipeline\r\n```mermaid\r\ngraph TD\r\n    A[Start] --\u003e B[User Interface Setup]\r\n    B --\u003e|Initialize| C[Streamlit Components]\r\n    \r\n    subgraph UI_Components[User Interface Components]\r\n        C --\u003e D1[URL Input Field]\r\n        C --\u003e D2[Model Selection Dropdown]\r\n        C --\u003e D3[Fields Input Tags]\r\n        C --\u003e D4[Chunk Size Slider]\r\n        C --\u003e D5[Chunk Overlap Slider]\r\n    end\r\n    \r\n    UI_Components --\u003e E[Scrape Button Clicked]\r\n    \r\n    E --\u003e F[Setup Selenium]\r\n    F --\u003e|Configure| F1[Set User Agent]\r\n    F --\u003e|Configure| F2[Set Headless Options]\r\n    F --\u003e|Initialize| F3[Chrome WebDriver]\r\n    \r\n    F3 --\u003e G[Fetch HTML]\r\n    G --\u003e|Selenium Actions| G1[Load Page]\r\n    G1 --\u003e G2[Scroll Page]\r\n    G2 --\u003e G3[Get Page Source]\r\n    \r\n    G3 --\u003e H[Clean HTML]\r\n    H --\u003e|BeautifulSoup| H1[Remove Headers]\r\n    H1 --\u003e|BeautifulSoup| H2[Remove Footers]\r\n    \r\n    H2 --\u003e I[Convert to Markdown]\r\n    I --\u003e|html2text| I1[Raw Markdown Text]\r\n    \r\n    I1 --\u003e J[Text Chunking]\r\n    J --\u003e|RecursiveCharacterTextSplitter| J1[Text Chunks]\r\n    \r\n    style A fill:#f9f,stroke:#333\r\n    style E fill:#bbf,stroke:#333\r\n    style J1 fill:#bfb,stroke:#333\r\n```\r\n\r\n## UI\r\n![UI-View](https://github.com/rahulsamant37/AI-Scraper/blob/main/data/UI.png)\r\n\r\n## 🔄 Web Scraping Workflow\r\n\r\n### 1️⃣ URL Retrieval\r\n- Utilized Selenium with randomized user agents for anonymity\r\n- Automated cookie consent handling for seamless navigation\r\n- Implemented dynamic scrolling to load complex page content\r\n- Captured the full HTML source for further processing\r\n\r\n### 2️⃣ HTML Processing\r\n- Cleaned HTML by removing headers, footers, and unnecessary elements\r\n- Converted HTML to markdown format using html2text\r\n- Removed URLs and preserved only meaningful content\r\n\r\n### 3️⃣ Data Extraction Strategy\r\n- Generated dynamic models based on user-specified fields using Pydantic\r\n- Integrated multiple AI models for intelligent extraction:\r\n  - GPT-4\r\n  - Gemini-1.5 Flash\r\n  - Llama3.1 (Local/Groq)\r\n- Designed chunk-based processing for large content\r\n- Produced structured JSON outputs\r\n\r\n### 4️⃣ Token \u0026 Cost Management\r\n- Tracked input and output tokens across models\r\n- Calculated per-model costs with different pricing schemes\r\n- Provided transparent cost metrics\r\n\r\n### 5️⃣ Output Options\r\n- Exported results in JSON, CSV, and Excel formats\r\n- Preserved markdown versions for documentation\r\n- Enabled comprehensive logging\r\n\r\n## Output\r\n![Ouput-View](https://github.com/rahulsamant37/AI-Scraper/blob/main/data/Output.gif)\r\n\r\n## ⚙️ Unique Aspects\r\n\r\n- **Adaptive Extraction**: Models adjust dynamically to user specifications\r\n- **Multi-Model Support**: Flexible AI model selection\r\n- **Transparent Token Tracking**: Detailed usage and cost insights\r\n\r\n## 🚀 Future Enhancements\r\n\r\n- Transitioning to a scalable backend using FASTAPI\r\n- Leveraging LangGraph for graph-based AI visualizations\r\n\r\n## 📚 Learning Resources\r\n\r\n- Web Scraping: @John Watson Rooney YouTube Channel\r\n- LangChain \u0026 AI: **Krish Naik** Sir's Udemy Course\r\n- Documentation: The ultimate teacher!\r\n\r\n## 🔧 Installation\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/yourusername/webscraping-ai-pipeline.git\r\ncd webscraping-ai-pipeline\r\n\r\n# Install dependencies\r\npip install -r requirements.txt\r\n\r\n# Set up environment variables\r\ncp .env.example .env\r\n# Edit .env with your API keys\r\n\r\n# Run the application\r\nstreamlit run app.py\r\n```\r\n\r\n## Resources Followed\r\n\r\nMr. Krish Naik for his comprehensive AI courses\r\nJohn Watson Rooney for web scraping tutorials\r\nFellow interns for their collaboration and support\r\n\r\n## 📜 License\r\nThis project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.\r\n\r\n## 🙏 Acknowledgments\r\nI want to express my sincere gratitude to:\r\n### Infosys Springboard Team\r\n\r\n- The mentors who provided invaluable guidance throughout the internship\r\n- The technical team for their support in overcoming challenges\r\n- The program coordinators for organizing this learning opportunity\r\n\r\n### Technical Community\r\n\r\n- The open-source community for providing excellent tools and libraries\r\n- Stack Overflow contributors for their helpful solutions\r\n- GitHub community for code examples and inspiration\r\n\r\n\r\n## 🤝 Connect With Me\r\n\r\nI'd love to hear your thoughts and suggestions! Feel free to connect and share your ideas.\r\n\r\n## Contact Information\r\nFor questions or collaboration opportunities:\r\n\r\n[![Email](https://img.shields.io/badge/Email-D14836?style=for-the-badge\u0026logo=gmail\u0026logoColor=white)](mailto:rahulsamantcoc2@gmail.com)  [![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge\u0026logo=github\u0026logoColor=white)](https://github.com/rahulsamant37/)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge\u0026logo=linkedin\u0026logoColor=white)](https://www.linkedin.com/in/rahul-samant-kb37/)\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahulsamant37%2Fai-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frahulsamant37%2Fai-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahulsamant37%2Fai-scraper/lists"}