https://github.com/firstflush/weaver

Async web scraping tool for HTTP and browser-based scraping
https://github.com/firstflush/weaver

curl curl-cffi etl-pipeline playwright playwright-python proxy python3 scraper tls-fingerprinting web webdriver webscraper webscraping webscraping-data

Last synced: 4 months ago
JSON representation

Async web scraping tool for HTTP and browser-based scraping

Host: GitHub
URL: https://github.com/firstflush/weaver
Owner: FirstFlush
License: mit
Created: 2025-09-13T03:08:14.000Z (5 months ago)
Default Branch: master
Last Pushed: 2025-09-23T18:30:28.000Z (4 months ago)
Last Synced: 2025-09-23T20:27:44.886Z (4 months ago)
Topics: curl, curl-cffi, etl-pipeline, playwright, playwright-python, proxy, python3, scraper, tls-fingerprinting, web, webdriver, webscraper, webscraping, webscraping-data
Language: Python
Homepage:
Size: 57.6 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Weaver

A modern, async-first web scraping framework for Python that combines the speed of HTTP requests with the power of browser automation.

## Why Weaver?

- **Unified Interface**: One framework, two scraping modes - use HTTP requests for speed or browser automation for JavaScript-heavy sites

- **Fully Async**: Native asyncio support across HTTP requests and browser automation for maximum concurrency

- **Anti-Detection**: Stealth capabilities to help you blend in

- **Flexible**: Mix and match HTTP and browser scraping within the same spider

- **Proxy Integration**: Built-in support for static and rotating proxy configurations

## Quick Start

```python

from weaver import BaseSpider, BrowserConfig

class BlogSpider(BaseSpider):

    def run(self):

        # Your scraping logic here

        pass

# Browser-based scraping

browser_config = BrowserConfig()

with BlogSpider(browser_config=browser_config) as spider:

    spider.run()

```

## Features

- **HTTP Client**: Fast async requests using aiohttp

- **Browser Client**: Full browser automation with Playwright  

- **Proxy Support**: Rotate through proxies seamlessly

- **Stealth Mode**: Anti-detection capabilities

- **Context Management**: Proper cleanup of resources automatically

## Installation

```bash

# Install the package

pip install weaver  # Coming soon

# Install browser binaries (required for browser automation)

playwright install

# Or install only Chromium to save space (~100MB vs ~300MB)

playwright install chromium

```

## Basic Usage

### HTTP-Only Scraping

```python

from weaver import BaseSpider, HttpConfig

class FastSpider(BaseSpider):

    def run(self):

        # Use self.http_client for requests

        pass

http_config = HttpConfig()

with FastSpider(http_config=http_config) as spider:

    spider.run()

```

### Browser Automation

```python

from weaver import BaseSpider, BrowserConfig

class BrowserSpider(BaseSpider):

    def run(self):

        # Use self.browser_client for Playwright

        pass

browser_config = BrowserConfig()

with BrowserSpider(browser_config=browser_config) as spider:

    spider.run()

```

### Hybrid Scraping

```python

from weaver import BaseSpider, HttpConfig, BrowserConfig

class HybridSpider(BaseSpider):

    def run(self):

        # Use both self.http_client and self.browser_client

        pass

http_config = HttpConfig()

browser_config = BrowserConfig()

with HybridSpider(http_config=http_config, browser_config=browser_config) as spider:

    spider.run()

```

## Development Status

⚠️ **Early Development**: Weaver is in active development. APIs may change frequently. Not recommended for production use yet.

## Requirements

- Python 3.10+

- aiohttp

- playwright

## Contributing

This project is in early stages. Contributions, ideas, and feedback are welcome!

## License

MIT License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/firstflush/weaver

Awesome Lists containing this project

README