https://github.com/rahulsamant37/ai-scraper

Universal Web Scraping AI Processing Pipeline: A dynamic, AI-powered web scraping and data extraction system with multi-model support, advanced text processing, and flexible output options for efficient data analysis.
https://github.com/rahulsamant37/ai-scraper

gemini-api groq-api langchain-python playwright-python pydantic rag selenium-python streamlit

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/rahulsamant37/ai-scraper
Owner: rahulsamant37
License: gpl-3.0
Created: 2024-12-27T22:12:48.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-12-27T23:59:13.000Z (12 months ago)
Last Synced: 2025-04-01T02:59:34.333Z (9 months ago)
Topics: gemini-api, groq-api, langchain-python, playwright-python, pydantic, rag, selenium-python, streamlit
Language: Python
Homepage:
Size: 1.35 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: Readme.md
- License: LICENSE

Awesome Lists containing this project

README

          # 🌟 Universal Web Scraping - AI Processing Pipeline

## 🎓 Infosys Springboard Internship

Excited to present the completion of my Infosys Springboard Internship Milestone 3! This project combines advanced web scraping with AI-powered data processing to create a flexible, robust data extraction pipeline.

## 🚀 Features

- **Multi-Provider AI Integration**: Support for OpenAI, Google Gemini, Llama, and Groq

- **Smart Web Scraping**: Selenium-based scraping with intelligent scroll handling

- **Advanced Text Processing**: Customizable chunking with overlap control

- **Dynamic Model Generation**: Creates data models based on user-defined fields

- **Multiple Export Formats**: JSON, CSV, Excel, and Markdown output options

- **Cost Tracking**: Automated token counting and cost calculation

- **User-Friendly Interface**: Streamlit-based UI with intuitive controls

## 🛠️ Technologies and Tools Used

- **Python**: Core scripting language for logic and data handling

- **Selenium & Playwright**: Dynamic web scraping and content handling

- **Pydantic**: Data processing, model generation, and validation

- **Streamlit**: Creating an intuitive and interactive user interface

- **LangChain & LangSmith**: For structured AI-driven data extraction and workflow tracking

- **ChatGoogleGenerativeAI & ChatGroq**: Enhancing AI model efficiency and accuracy

## 📊 System Architecture

### Dynamic Container Model

```mermaid

graph TD

    A[User Input Fields] -->|Example Input| B["Fields = ['price', 'title', 'description']"]

    

    subgraph Dynamic_Listing_Model[Dynamic Listing Model Creation]

        B --> C[Create Single Item Structure]

        C -->|Creates| D[Pydantic Model]

        D --> E["Single Item Schema:

        {

            'price': string,

            'title': string,

            'description': string

        }"]

    end

    

    subgraph Container_Model[Container Model Creation]

        E --> F[Create Container Structure]

        F -->|Wraps Items| G["Final Schema:

        {

            'listings': [

                {item1},

                {item2},

                {item3},

                ...

            ]

        }"]

    end

    

    H[Real World Example] --> I["User wants to scrape:

    - Product Name

    - Price

    - Rating"]

    

    I --> J["Creates Model:

    {

        'listings': [

            {

                'Product Name': 'iPhone 13',

                'Price': '$799',

                'Rating': '4.5'

            },

            {

                'Product Name': 'Galaxy S21',

                'Price': '$699',

                'Rating': '4.3'

            }

        ]

    }"]

    

    style Dynamic_Listing_Model fill:#ffd,stroke:#333

    style Container_Model fill:#dff,stroke:#333

```

### AI Processing Pipeline

```mermaid

graph TD

    A[Start] --> B[User Interface Setup]

    B -->|Initialize| C[Streamlit Components]

    

    subgraph UI_Components[User Interface Components]

        C --> D1[URL Input Field]

        C --> D2[Model Selection Dropdown]

        C --> D3[Fields Input Tags]

        C --> D4[Chunk Size Slider]

        C --> D5[Chunk Overlap Slider]

    end

    

    UI_Components --> E[Scrape Button Clicked]

    

    E --> F[Setup Selenium]

    F -->|Configure| F1[Set User Agent]

    F -->|Configure| F2[Set Headless Options]

    F -->|Initialize| F3[Chrome WebDriver]

    

    F3 --> G[Fetch HTML]

    G -->|Selenium Actions| G1[Load Page]

    G1 --> G2[Scroll Page]

    G2 --> G3[Get Page Source]

    

    G3 --> H[Clean HTML]

    H -->|BeautifulSoup| H1[Remove Headers]

    H1 -->|BeautifulSoup| H2[Remove Footers]

    

    H2 --> I[Convert to Markdown]

    I -->|html2text| I1[Raw Markdown Text]

    

    I1 --> J[Text Chunking]

    J -->|RecursiveCharacterTextSplitter| J1[Text Chunks]

    

    style A fill:#f9f,stroke:#333

    style E fill:#bbf,stroke:#333

    style J1 fill:#bfb,stroke:#333

```

## UI

![UI-View](https://github.com/rahulsamant37/AI-Scraper/blob/main/data/UI.png)

## 🔄 Web Scraping Workflow

### 1️⃣ URL Retrieval

- Utilized Selenium with randomized user agents for anonymity

- Automated cookie consent handling for seamless navigation

- Implemented dynamic scrolling to load complex page content

- Captured the full HTML source for further processing

### 2️⃣ HTML Processing

- Cleaned HTML by removing headers, footers, and unnecessary elements

- Converted HTML to markdown format using html2text

- Removed URLs and preserved only meaningful content

### 3️⃣ Data Extraction Strategy

- Generated dynamic models based on user-specified fields using Pydantic

- Integrated multiple AI models for intelligent extraction:

  - GPT-4

  - Gemini-1.5 Flash

  - Llama3.1 (Local/Groq)

- Designed chunk-based processing for large content

- Produced structured JSON outputs

### 4️⃣ Token & Cost Management

- Tracked input and output tokens across models

- Calculated per-model costs with different pricing schemes

- Provided transparent cost metrics

### 5️⃣ Output Options

- Exported results in JSON, CSV, and Excel formats

- Preserved markdown versions for documentation

- Enabled comprehensive logging

## Output

![Ouput-View](https://github.com/rahulsamant37/AI-Scraper/blob/main/data/Output.gif)

## ⚙️ Unique Aspects

- **Adaptive Extraction**: Models adjust dynamically to user specifications

- **Multi-Model Support**: Flexible AI model selection

- **Transparent Token Tracking**: Detailed usage and cost insights

## 🚀 Future Enhancements

- Transitioning to a scalable backend using FASTAPI

- Leveraging LangGraph for graph-based AI visualizations

## 📚 Learning Resources

- Web Scraping: @John Watson Rooney YouTube Channel

- LangChain & AI: **Krish Naik** Sir's Udemy Course

- Documentation: The ultimate teacher!

## 🔧 Installation

```bash

# Clone the repository

git clone https://github.com/yourusername/webscraping-ai-pipeline.git

cd webscraping-ai-pipeline

# Install dependencies

pip install -r requirements.txt

# Set up environment variables

cp .env.example .env

# Edit .env with your API keys

# Run the application

streamlit run app.py

```

## Resources Followed

Mr. Krish Naik for his comprehensive AI courses

John Watson Rooney for web scraping tutorials

Fellow interns for their collaboration and support

## 📜 License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

## 🙏 Acknowledgments

I want to express my sincere gratitude to:

### Infosys Springboard Team

- The mentors who provided invaluable guidance throughout the internship

- The technical team for their support in overcoming challenges

- The program coordinators for organizing this learning opportunity

### Technical Community

- The open-source community for providing excellent tools and libraries

- Stack Overflow contributors for their helpful solutions

- GitHub community for code examples and inspiration

## 🤝 Connect With Me

I'd love to hear your thoughts and suggestions! Feel free to connect and share your ideas.

## Contact Information

For questions or collaboration opportunities:

[![Email](https://img.shields.io/badge/Email-D14836?style=for-the-badge&logo=gmail&logoColor=white)](mailto:rahulsamantcoc2@gmail.com)  [![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/rahulsamant37/)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/rahul-samant-kb37/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rahulsamant37/ai-scraper

Awesome Lists containing this project

README