https://github.com/oussemabenhassena5/crawl4deepseek
Crawl4DeepSeek = Crawl4AI + DeepSeek π Smart, efficient, and built for deep web exploration! ππ€
https://github.com/oussemabenhassena5/crawl4deepseek
crawl4ai deepseek python webcrawling webscraping
Last synced: about 1 month ago
JSON representation
Crawl4DeepSeek = Crawl4AI + DeepSeek π Smart, efficient, and built for deep web exploration! ππ€
- Host: GitHub
- URL: https://github.com/oussemabenhassena5/crawl4deepseek
- Owner: oussemabenhassena5
- License: mit
- Created: 2025-02-05T10:39:48.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-02-09T13:54:12.000Z (3 months ago)
- Last Synced: 2025-02-15T06:36:30.377Z (3 months ago)
- Topics: crawl4ai, deepseek, python, webcrawling, webscraping
- Language: Python
- Homepage:
- Size: 23.4 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# πΈοΈ DeepSeek Crawler
[](https://www.python.org/downloads/)
[](https://docs.crawl4ai.com/)
[](https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file)
[](LICENSE)> *Unleashing AI-Powered Web Scraping at Scale* π
## π― What Makes This Special
Deep Seek Crawler represents the next generation of web scraping, combining asyncio's power with DeepSeek's AI capabilities to transform chaotic web data into structured intelligence. Built for performance, scalability, and precision.
## π₯ Key Features
### Intelligence Layer
- **Smart Pagination**: Autonomous detection of result boundaries and page termination
- **Duplicate Prevention**: Intelligent tracking of seen venues using efficient set operations
- **Polite Crawling**: Built-in rate limiting with configurable sleep intervals
- **Robust Error Handling**: Graceful handling of no-results scenarios### Engineering Excellence
- **Asynchronous Architecture**: Built on Python's asyncio for maximum performance
- **Modular Design**: Clean separation of concerns with utility modules
- **Session Management**: Persistent crawling sessions with automatic cleanup
- **CSV Export**: Structured data output with comprehensive venue information## ποΈ Architecture
```mermaid
graph TD
A[Main Crawler] --> B[AsyncWebCrawler]
B --> C[Page Processor]
C --> D[LLM Strategy]
D --> E[Data Exporter]
B --> F[Browser Config]
C --> G[Data Utils]
G --> E
```## π» Technical Implementation
### Core Components
- **AsyncWebCrawler**: High-performance asynchronous crawling engine
- **LLM Strategy**: AI-powered content extraction and processing
- **Browser Configuration**: Customizable crawler behavior settings
- **Data Utilities**: Robust data processing and export functionality### Performance Features
- **Efficient Memory Usage**: Set-based duplicate detection
- **Controlled Crawling**: Configurable delay between requests
- **Graceful Termination**: Smart detection of crawl completion
- **Usage Statistics**: Built-in LLM strategy usage tracking## π Quick Start
1. **Clone & Setup**:
```bash
git clone https://github.com/oussemabenhassena5/Crawl4DeepSeek.git
cd Crawl4DeepSeek
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
```2. **Configure Environment**:
```bash
# .env file
GROQ_API_KEY=your_api_key
```3. **Launch Crawler**:
```bash
python crawler.py
```## π Project Structure
```
crawl4deepseek/
βββ crawler.py # Main crawling script
βββ config.py # Configuration settings
βββ utils/
β βββ data_utils.py # Data processing utilities
β βββ scraper_utils.py # Crawling utility functions
βββ requirements.txt # Project dependencies
βββ .env # Environment configuration
```## π οΈ Engineering Highlights
- **Async Processing**: Efficient handling of concurrent page fetches
- **Smart State Management**: Tracking of seen venues and crawl progress
- **Configurable Behavior**: Easy-to-modify crawler settings
- **Comprehensive Logging**: Detailed crawl progress and statistics## π Development Workflow
The crawler follows a systematic approach:
1. Initializes configurations and strategies
2. Processes pages asynchronously
3. Checks for duplicate venues
4. Exports structured data
5. Provides usage statistics## π― Future Roadmap
- [ ] Enhanced error recovery mechanisms
- [ ] Multi-site crawling support
- [ ] Advanced data validation
- [ ] Performance optimization for large-scale crawls## π€ Contributing
Contributions are welcome! Feel free to submit issues and pull requests.
## π License
Distributed under the MIT License. See `LICENSE` for more information.
---
**Built with π» by [Oussema Ben Hassena](https://github.com/oussemabenhassena5)**
*Transforming Web Data into Intelligence*
[LinkedIn](linkedin.com/in/oussema-ben-hassena-b445122a4)