Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stevieflyer/quokka
An easy-to-use web crawler framework, supporting parallel crawling without a line of code and headless running.
https://github.com/stevieflyer/quokka
crawler parallel web-automation
Last synced: 22 days ago
JSON representation
An easy-to-use web crawler framework, supporting parallel crawling without a line of code and headless running.
- Host: GitHub
- URL: https://github.com/stevieflyer/quokka
- Owner: stevieflyer
- License: mit
- Created: 2023-09-05T03:57:41.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-09-09T08:07:31.000Z (over 1 year ago)
- Last Synced: 2024-01-27T17:03:11.004Z (11 months ago)
- Topics: crawler, parallel, web-automation
- Language: Python
- Homepage:
- Size: 28.3 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Quokka - Browser Automation Library with Playwright
Quokka is a powerful Python library built on top of Playwright, designed to simplify browser automation and manipulation tasks. It provides a convenient facade for various browser interactions, making it easier to navigate web pages, extract data, and interact with page elements.
## Key Features
- **Asynchronous and Parallel Execution:** Quokka operates entirely in an asynchronous manner. Leveraging the power of Playwright, it utilizes multiple processes, each containing a single coroutine, for efficient parallel execution. This architecture excels in handling both IO and CPU-bound workloads when ample resources are available.
- **Multi-threaded Crawling with Ease:** Quokka's `BaseCrawler` class enables users to effortlessly transition from single-threaded to multi-threaded crawling. By taking advantage of the provided crawler template, you can seamlessly convert a single-threaded crawler into a multi-threaded one.
- Easy Browser Management: Quokka's `Agent` class provides a streamlined interface for managing browser instances, including starting, stopping, and page navigation.
- Data Extraction: With the `data_extractor` module, Quokka allows you to easily extract data from web pages using customizable selectors and extraction patterns.
- Page Interaction: The `page_interactor` module enables you to interact with web page elements, such as clicking, typing, and scrolling, making automation tasks a breeze.
- Custom Hooks: Quokka supports customizable hooks, allowing you to extend and customize the behavior of the `Agent` class to fit your specific needs.
- Extensible: Quokka exposes Playwright's `playwright` and `page` instances, enabling users to extend the library's functionality as required.## Installation
```bash
pip install quokka-web
```## Getting Started
Quokka's intuitive API makes browser automation a straightforward process. Here's a simple example:```python
from quokka_web import Agentasync def main():
agent = await Agent.instantiate(headless=True)
await agent.start()# Your automation code here
await agent.stop()
if __name__ == "__main__":
import asyncioasyncio.run(main())
```## Documentation
For detailed usage instructions, examples, and customization options, please refer to the [Documentation](link_to_documentation).
## Examples
Base Crawler Example:
```python
from quokka_web import BaseCrawler, Debuggerclass MyCrawler(BaseCrawler):
async def _crawl(self, *args, **kwargs):
# Core crawling logic using browser_agentif __name__ == "__main__":
import asyncioasync def main():
crawler = await MyCrawler.instantiate(debug_tool=Debugger(verbose=True))
await crawler.start()
await crawler.crawl()
await crawler.stop()asyncio.run(main())
```
## ContributingContributions to Quokka are welcome! Please read our [Contribution Guidelines](link_to_contribution_guidelines) for more information on how to contribute to the project.
## License
This project is licensed under the [MIT License](link_to_license).