https://github.com/raghu6798/browser_scrape_mcp
https://github.com/raghu6798/browser_scrape_mcp
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/raghu6798/browser_scrape_mcp
- Owner: Raghu6798
- Created: 2025-04-19T14:14:06.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-05-02T10:18:17.000Z (5 months ago)
- Last Synced: 2025-06-17T13:05:30.328Z (4 months ago)
- Language: Python
- Size: 103 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π€ Browser Automation Agent
A powerful browser automation tool built with MCP (Model Controlled Program) that combines web scraping capabilities with LLM-powered intelligence. This agent can search Google, navigate to webpages, and intelligently scrape content from various websites including GitHub, Stack Overflow, and documentation sites.
## π Features
- **π Google Search Integration**: Finds and retrieves top search results for any query
- **πΈοΈ Intelligent Web Scraping**: Tailored scraping strategies for different website types:
- π GitHub repositories
- π¬ Stack Overflow questions and answers
- π Documentation pages
- π Generic websites
- **π§ AI-Powered Processing**: Uses Mistral AI for understanding and processing scraped content
- **π₯· Stealth Mode**: Implements browser fingerprint protection to avoid detection
- **πΎ Content Saving**: Automatically saves both screenshots and text content from scraped pages## ποΈ Architecture
This project uses a client-server architecture powered by MCP:
- **π₯οΈ Server**: Handles browser automation and web scraping tasks
- **π€ Client**: Provides the AI interface using Mistral AI and LangGraph
- **π‘ Communication**: Uses stdio for client-server communication## βοΈ Requirements
- π Python 3.8+
- π Playwright
- π§© MCP (Model Controlled Program)
- π Mistral AI API key## π₯ Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/browser-automation-agent.git
cd browser-automation-agent
```2. Install dependencies:
```bash
pip install -r requirements.txt
```3. Install Playwright browsers:
```bash
playwright install
```4. Create a `.env` file in the project root and add your Mistral AI API key:
```
MISTRAL_API_KEY=your_api_key_here
```## π Usage
### Running the Server
```bash
python main.py
```### Running the Client
```bash
python client.py
```### Sample Interaction
Once both the server and client are running:
1. Enter your query when prompted
2. The agent will:
- π Search Google for relevant results
- π§ Navigate to the top result
- π Scrape content based on the website type
- πΈ Save screenshots and content to files
- π€ Return processed information## π οΈ Tool Functions
### `get_top_google_url`
π Searches Google and returns the top result URL for a given query.### `browse_and_scrape`
π Navigates to a URL and scrapes content based on the website type.### `scrape_github`
π Specializes in extracting README content and code blocks from GitHub repositories.### `scrape_stackoverflow`
π¬ Extracts questions, answers, comments, and code blocks from Stack Overflow pages.### `scrape_documentation`
π Optimized for extracting documentation content and code examples.### `scrape_generic`
π Extracts paragraph text and code blocks from generic websites.## π File Structure
```
browser-automation-agent/
βββ main.py # MCP server implementation
βββ client.py # Mistral AI client implementation
βββ requirements.txt # Project dependencies
βββ .env # Environment variables (API keys)
βββ README.md # Project documentation
```## π€ Output Files
The agent generates two types of output files with timestamps:
- πΈ `final_page_YYYYMMDD_HHMMSS.png`: Screenshot of the final page state
- π `scraped_content_YYYYMMDD_HHMMSS.txt`: Extracted text content from the page## βοΈ Customization
You can modify the following parameters in the code:
- π₯οΈ Browser window size: Adjust `width` and `height` in `browse_and_scrape`
- π» Headless mode: Set `headless=True` for invisible browser operation
- π’ Number of Google results: Change `num_results` in `get_top_google_url`## β Troubleshooting
- **π Connection Issues**: Ensure both server and client are running in separate terminals
- **π Playwright Errors**: Make sure browsers are installed with `playwright install`
- **π API Key Errors**: Verify your Mistral API key is correctly set in the `.env` file
- **π£οΈ Path Errors**: Update the path to `main.py` in `client.py` if needed## π License
[MIT License](LICENSE)
## π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
---
Built with π§© MCP, π Playwright, and π§ Mistral AI