https://github.com/davidyen1124/ai-crawler
AI web scraper using GPT to dynamically optimize CSS selectors for reliable data extraction.
https://github.com/davidyen1124/ai-crawler
ai automation css-selector gpt nodejs openai playwright scraping
Last synced: 2 months ago
JSON representation
AI web scraper using GPT to dynamically optimize CSS selectors for reliable data extraction.
- Host: GitHub
- URL: https://github.com/davidyen1124/ai-crawler
- Owner: davidyen1124
- License: mit
- Created: 2024-07-27T18:05:41.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-07-27T18:09:55.000Z (9 months ago)
- Last Synced: 2025-01-05T04:38:23.971Z (4 months ago)
- Topics: ai, automation, css-selector, gpt, nodejs, openai, playwright, scraping
- Language: JavaScript
- Homepage:
- Size: 7.81 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AI-Powered Web Scraper
This project is an AI-powered web scraper that uses OpenAI's GPT model to dynamically analyze and optimize CSS selectors for reliable web scraping.
## Features
- Dynamic CSS selector optimization using AI
- Visual feedback with highlighted elements in the browser
- Automatic screenshot capture for AI analysis
- Simplified DOM tree structure analysis
- Configurable scraping goals## Prerequisites
- Node.js (v14 or later recommended)
- An OpenAI API key## Installation
1. Clone the repository:
```
git clone https://github.com/yourusername/ai-powered-web-scraper.git
cd ai-powered-web-scraper
```2. Install dependencies:
```
npm install
```3. Create a `config.js` file in the root directory with your OpenAI API key:
```javascript
module.exports = {
OPENAI_API_KEY: 'your-api-key-here',
MODEL: 'gpt-4o-mini'
}
```## Usage
To start the web scraper, run:
```
node crawler.js
```You can modify the `scrapingGoal` and target URL in the `crawler.js` file to customize the scraping task.
## How it Works
1. The scraper starts with an initial CSS selector and loads the target webpage.
2. It captures a screenshot and analyzes the DOM structure.
3. The AI model analyzes the current selector, screenshot, and DOM structure to suggest optimizations.
4. The process repeats until the AI determines the selector is optimal or no further improvements can be made.
5. Finally, the scraper extracts the desired information using the optimized selector.## Files
- `crawler.js`: Main script that controls the web scraping process.
- `openai.js`: Handles interactions with the OpenAI API for selector analysis.
- `config.js`: Contains configuration settings (API key, model name).## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License.