Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/roniemartinez/dude

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
https://github.com/roniemartinez/dude

async beautifulsoup4 crawler css framework lxml parsel playwright python scraper scraping selenium sync web-scraping webscraping xpath

Last synced: about 17 hours ago
JSON representation

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

Awesome Lists containing this project

README

        


License
License
Version
Version


Github Actions
Github Actions
Coverage
CodeCov


Supported versions
Python Versions
Wheel
Wheel


Status
Status
Downloads
Downloads


All Contributors
All Contributors

# dude uncomplicated data extraction

Dude is a very simple framework for writing web scrapers using Python decorators.
The design, inspired by [Flask](https://github.com/pallets/flask), was to easily build a web scraper in just a few lines of code.
Dude has an easy-to-learn syntax.

> 🚨 Dude is currently in Pre-Alpha. Please expect breaking changes.

## Installation

To install, simply run the following from terminal.

```bash
pip install pydude
playwright install # Install playwright binaries for Chrome, Firefox and Webkit.
```

## Minimal web scraper

The simplest web scraper will look like this:

```python
from dude import select

@select(css="a")
def get_link(element):
return {"url": element.get_attribute("href")}
```

The example above will get all the [hyperlink](https://en.wikipedia.org/wiki/Hyperlink#HTML) elements in a page and calls the handler function `get_link()` for each element.

## How to run the scraper

You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to `dude scrape` command.

```bash
dude scrape --url "" --output data.json path/to/script.py
```

The output in `data.json` should contain the actual URL and the metadata prepended with underscore.

```json5
[
{
"_page_number": 1,
"_page_url": "https://dude.ron.sh/",
"_group_id": 4502003824,
"_group_index": 0,
"_element_index": 0,
"url": "/url-1.html"
},
{
"_page_number": 1,
"_page_url": "https://dude.ron.sh/",
"_group_id": 4502003824,
"_group_index": 0,
"_element_index": 1,
"url": "/url-2.html"
},
{
"_page_number": 1,
"_page_url": "https://dude.ron.sh/",
"_group_id": 4502003824,
"_group_index": 0,
"_element_index": 2,
"url": "/url-3.html"
}
]
```

Changing the output to `--output data.csv` should result in the following CSV content.

![data.csv](docs/csv.png)

## Features

- Simple [Flask](https://github.com/pallets/flask)-inspired design - build a scraper with decorators.
- Uses [Playwright](https://playwright.dev/python/) API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.
- Data grouping - group related results.
- URL pattern matching - run functions on matched URLs.
- Priority - reorder functions based on priority.
- Setup function - enable setup steps (clicking dialogs or login).
- Navigate function - enable navigation steps to move to other pages.
- Custom storage - option to save data to other formats or database.
- Async support - write async handlers.
- Option to use other parser backends aside from Playwright.
- [BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html) - `pip install pydude[bs4]`
- [Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html) - `pip install pydude[parsel]`
- [lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html) - `pip install pydude[lxml]`
- [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html) - `pip install pydude[selenium]`
- Option to follow all links indefinitely (Crawler/Spider).
- Events - attach functions to startup, pre-setup, post-setup and shutdown events.
- Option to save data on every page.

## Supported Parser Backends

By default, Dude uses Playwright but gives you an option to use parser backends that you are familiar with.
It is possible to use parser backends like
[BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html),
[Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html),
[lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html),
and [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html).

Here is the summary of features supported by each parser backend.


Parser Backend
Supports
Sync?
Supports
Async?
Selectors
Setup
Handler

Navigate
Handler

Comments


CSS
XPath
Text
Regex


Playwright
✅
✅
✅
✅
✅
✅
✅
✅



BeautifulSoup4
✅
✅
✅
🚫
🚫
🚫
🚫
🚫



Parsel
✅
✅
✅
✅
✅
✅
🚫
🚫



lxml
✅
✅
✅
✅
✅
✅
🚫
🚫



Pyppeteer
🚫
✅
✅
✅
✅
🚫
✅
✅
Not supported from 0.23.0


Selenium
✅
✅
✅
✅
✅
🚫
✅
✅

## Using the Docker image

Pull the docker image using the following command.

```console
docker pull roniemartinez/dude
```

Assuming that `script.py` exist in the current directory, run Dude using the following command.

```console
docker run -it --rm -v "$PWD":/code roniemartinez/dude dude scrape --url script.py
```

## Documentation

Read the complete documentation at [https://roniemartinez.github.io/dude/](https://roniemartinez.github.io/dude/).
All the advanced and useful features are documented there.

## Requirements

- ✅ Any dude should know how to work with selectors (CSS or XPath).
- ✅ Familiarity with any backends that you love (see [Supported Parser Backends](#supported-parser-backends))
- ✅ Python decorators... you'll live, dude!

## Why name this project "dude"?

- ✅ A [Recursive acronym](https://en.wikipedia.org/wiki/Recursive_acronym) looks nice.
- ✅ Adding "uncomplicated" (like [`ufw`](https://wiki.ubuntu.com/UncomplicatedFirewall)) into the name says it is a very simple framework.
- ✅ Puns! I also think that if you want to do web scraping, there's probably some random dude around the corner who can make it very easy for you to start with it. 😊

## Author

[Ronie Martinez](mailto:[email protected])

## Contributors ✨

Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):



Ronie Martinez

🚧 💻 📖 🚇

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!