https://github.com/firstflush/weaver
Async web scraping tool for HTTP and browser-based scraping
https://github.com/firstflush/weaver
curl curl-cffi etl-pipeline playwright playwright-python proxy python3 scraper tls-fingerprinting web webdriver webscraper webscraping webscraping-data
Last synced: 3 months ago
JSON representation
Async web scraping tool for HTTP and browser-based scraping
- Host: GitHub
- URL: https://github.com/firstflush/weaver
- Owner: FirstFlush
- License: mit
- Created: 2025-09-13T03:08:14.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2025-09-23T18:30:28.000Z (4 months ago)
- Last Synced: 2025-09-23T20:27:44.886Z (4 months ago)
- Topics: curl, curl-cffi, etl-pipeline, playwright, playwright-python, proxy, python3, scraper, tls-fingerprinting, web, webdriver, webscraper, webscraping, webscraping-data
- Language: Python
- Homepage:
- Size: 57.6 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Weaver
A modern, async-first web scraping framework for Python that combines the speed of HTTP requests with the power of browser automation.
## Why Weaver?
- **Unified Interface**: One framework, two scraping modes - use HTTP requests for speed or browser automation for JavaScript-heavy sites
- **Fully Async**: Native asyncio support across HTTP requests and browser automation for maximum concurrency
- **Anti-Detection**: Stealth capabilities to help you blend in
- **Flexible**: Mix and match HTTP and browser scraping within the same spider
- **Proxy Integration**: Built-in support for static and rotating proxy configurations
## Quick Start
```python
from weaver import BaseSpider, BrowserConfig
class BlogSpider(BaseSpider):
def run(self):
# Your scraping logic here
pass
# Browser-based scraping
browser_config = BrowserConfig()
with BlogSpider(browser_config=browser_config) as spider:
spider.run()
```
## Features
- **HTTP Client**: Fast async requests using aiohttp
- **Browser Client**: Full browser automation with Playwright
- **Proxy Support**: Rotate through proxies seamlessly
- **Stealth Mode**: Anti-detection capabilities
- **Context Management**: Proper cleanup of resources automatically
## Installation
```bash
# Install the package
pip install weaver # Coming soon
# Install browser binaries (required for browser automation)
playwright install
# Or install only Chromium to save space (~100MB vs ~300MB)
playwright install chromium
```
## Basic Usage
### HTTP-Only Scraping
```python
from weaver import BaseSpider, HttpConfig
class FastSpider(BaseSpider):
def run(self):
# Use self.http_client for requests
pass
http_config = HttpConfig()
with FastSpider(http_config=http_config) as spider:
spider.run()
```
### Browser Automation
```python
from weaver import BaseSpider, BrowserConfig
class BrowserSpider(BaseSpider):
def run(self):
# Use self.browser_client for Playwright
pass
browser_config = BrowserConfig()
with BrowserSpider(browser_config=browser_config) as spider:
spider.run()
```
### Hybrid Scraping
```python
from weaver import BaseSpider, HttpConfig, BrowserConfig
class HybridSpider(BaseSpider):
def run(self):
# Use both self.http_client and self.browser_client
pass
http_config = HttpConfig()
browser_config = BrowserConfig()
with HybridSpider(http_config=http_config, browser_config=browser_config) as spider:
spider.run()
```
## Development Status
⚠️ **Early Development**: Weaver is in active development. APIs may change frequently. Not recommended for production use yet.
## Requirements
- Python 3.10+
- aiohttp
- playwright
## Contributing
This project is in early stages. Contributions, ideas, and feedback are welcome!
## License
MIT License