Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stevieflyer/quokka

An easy-to-use web crawler framework, supporting parallel crawling without a line of code and headless running.
https://github.com/stevieflyer/quokka

crawler parallel web-automation

Last synced: about 2 months ago
JSON representation

An easy-to-use web crawler framework, supporting parallel crawling without a line of code and headless running.

Host: GitHub
URL: https://github.com/stevieflyer/quokka
Owner: stevieflyer
License: mit
Created: 2023-09-05T03:57:41.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2023-09-09T08:07:31.000Z (over 1 year ago)
Last Synced: 2024-01-27T17:03:11.004Z (about 1 year ago)
Topics: crawler, parallel, web-automation
Language: Python
Homepage:
Size: 28.3 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Quokka - Browser Automation Library with Playwright

Quokka is a powerful Python library built on top of Playwright, designed to simplify browser automation and manipulation tasks. It provides a convenient facade for various browser interactions, making it easier to navigate web pages, extract data, and interact with page elements.

## Key Features

- **Asynchronous and Parallel Execution:** Quokka operates entirely in an asynchronous manner. Leveraging the power of Playwright, it utilizes multiple processes, each containing a single coroutine, for efficient parallel execution. This architecture excels in handling both IO and CPU-bound workloads when ample resources are available.

- **Multi-threaded Crawling with Ease:** Quokka's `BaseCrawler` class enables users to effortlessly transition from single-threaded to multi-threaded crawling. By taking advantage of the provided crawler template, you can seamlessly convert a single-threaded crawler into a multi-threaded one.

- Easy Browser Management: Quokka's `Agent` class provides a streamlined interface for managing browser instances, including starting, stopping, and page navigation.

- Data Extraction: With the `data_extractor` module, Quokka allows you to easily extract data from web pages using customizable selectors and extraction patterns.

- Page Interaction: The `page_interactor` module enables you to interact with web page elements, such as clicking, typing, and scrolling, making automation tasks a breeze.

- Custom Hooks: Quokka supports customizable hooks, allowing you to extend and customize the behavior of the `Agent` class to fit your specific needs.

- Extensible: Quokka exposes Playwright's `playwright` and `page` instances, enabling users to extend the library's functionality as required.

## Installation

```bash

pip install quokka-web

```

## Getting Started

Quokka's intuitive API makes browser automation a straightforward process. Here's a simple example:

```python

from quokka_web import Agent

async def main():

    agent = await Agent.instantiate(headless=True)

    await agent.start()

    # Your automation code here

    await agent.stop()

if __name__ == "__main__":

    import asyncio

    asyncio.run(main())

```

## Documentation

For detailed usage instructions, examples, and customization options, please refer to the [Documentation](link_to_documentation).

## Examples

Base Crawler Example:

```python

from quokka_web import BaseCrawler, Debugger

class MyCrawler(BaseCrawler):

    async def _crawl(self, *args, **kwargs):

# Core crawling logic using browser_agent

if __name__ == "__main__":

    import asyncio

    async def main():

        crawler = await MyCrawler.instantiate(debug_tool=Debugger(verbose=True))

        await crawler.start()

        await crawler.crawl()

        await crawler.stop()

    asyncio.run(main())

```

## Contributing

Contributions to Quokka are welcome! Please read our [Contribution Guidelines](link_to_contribution_guidelines) for more information on how to contribute to the project.

## License

This project is licensed under the [MIT License](link_to_license).