Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/roniemartinez/dude
dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
https://github.com/roniemartinez/dude
async beautifulsoup4 crawler css framework lxml parsel playwright python scraper scraping selenium sync web-scraping webscraping xpath
Last synced: about 17 hours ago
JSON representation
dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
- Host: GitHub
- URL: https://github.com/roniemartinez/dude
- Owner: roniemartinez
- License: agpl-3.0
- Created: 2022-02-14T12:55:45.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2025-01-02T20:47:46.000Z (9 days ago)
- Last Synced: 2025-01-05T03:32:53.910Z (6 days ago)
- Topics: async, beautifulsoup4, crawler, css, framework, lxml, parsel, playwright, python, scraper, scraping, selenium, sync, web-scraping, webscraping, xpath
- Language: Python
- Homepage: https://roniemartinez.github.io/dude/
- Size: 2.24 MB
- Stars: 425
- Watchers: 10
- Forks: 19
- Open Issues: 24
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Support: docs/supported_parser_backends/index.md
Awesome Lists containing this project
- project-awesome - roniemartinez/dude - dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators (Python)
README
License
Version
Github Actions
Coverage
Supported versions
Wheel
Status
Downloads
All Contributors
# dude uncomplicated data extraction
Dude is a very simple framework for writing web scrapers using Python decorators.
The design, inspired by [Flask](https://github.com/pallets/flask), was to easily build a web scraper in just a few lines of code.
Dude has an easy-to-learn syntax.> 🚨 Dude is currently in Pre-Alpha. Please expect breaking changes.
## Installation
To install, simply run the following from terminal.
```bash
pip install pydude
playwright install # Install playwright binaries for Chrome, Firefox and Webkit.
```## Minimal web scraper
The simplest web scraper will look like this:
```python
from dude import select@select(css="a")
def get_link(element):
return {"url": element.get_attribute("href")}
```The example above will get all the [hyperlink](https://en.wikipedia.org/wiki/Hyperlink#HTML) elements in a page and calls the handler function `get_link()` for each element.
## How to run the scraper
You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to `dude scrape` command.
```bash
dude scrape --url "" --output data.json path/to/script.py
```The output in `data.json` should contain the actual URL and the metadata prepended with underscore.
```json5
[
{
"_page_number": 1,
"_page_url": "https://dude.ron.sh/",
"_group_id": 4502003824,
"_group_index": 0,
"_element_index": 0,
"url": "/url-1.html"
},
{
"_page_number": 1,
"_page_url": "https://dude.ron.sh/",
"_group_id": 4502003824,
"_group_index": 0,
"_element_index": 1,
"url": "/url-2.html"
},
{
"_page_number": 1,
"_page_url": "https://dude.ron.sh/",
"_group_id": 4502003824,
"_group_index": 0,
"_element_index": 2,
"url": "/url-3.html"
}
]
```Changing the output to `--output data.csv` should result in the following CSV content.
![data.csv](docs/csv.png)
## Features
- Simple [Flask](https://github.com/pallets/flask)-inspired design - build a scraper with decorators.
- Uses [Playwright](https://playwright.dev/python/) API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.
- Data grouping - group related results.
- URL pattern matching - run functions on matched URLs.
- Priority - reorder functions based on priority.
- Setup function - enable setup steps (clicking dialogs or login).
- Navigate function - enable navigation steps to move to other pages.
- Custom storage - option to save data to other formats or database.
- Async support - write async handlers.
- Option to use other parser backends aside from Playwright.
- [BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html) - `pip install pydude[bs4]`
- [Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html) - `pip install pydude[parsel]`
- [lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html) - `pip install pydude[lxml]`
- [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html) - `pip install pydude[selenium]`
- Option to follow all links indefinitely (Crawler/Spider).
- Events - attach functions to startup, pre-setup, post-setup and shutdown events.
- Option to save data on every page.## Supported Parser Backends
By default, Dude uses Playwright but gives you an option to use parser backends that you are familiar with.
It is possible to use parser backends like
[BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html),
[Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html),
[lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html),
and [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html).Here is the summary of features supported by each parser backend.
Parser Backend
Supports
Sync?
Supports
Async?
Selectors
Setup
Handler
Navigate
Handler
Comments
CSS
XPath
Text
Regex
Playwright
✅
✅
✅
✅
✅
✅
✅
✅
BeautifulSoup4
✅
✅
✅
🚫
🚫
🚫
🚫
🚫
Parsel
✅
✅
✅
✅
✅
✅
🚫
🚫
lxml
✅
✅
✅
✅
✅
✅
🚫
🚫
Pyppeteer
🚫
✅
✅
✅
✅
🚫
✅
✅
Not supported from 0.23.0
Selenium
✅
✅
✅
✅
✅
🚫
✅
✅
## Using the Docker image
Pull the docker image using the following command.
```console
docker pull roniemartinez/dude
```Assuming that `script.py` exist in the current directory, run Dude using the following command.
```console
docker run -it --rm -v "$PWD":/code roniemartinez/dude dude scrape --url script.py
```## Documentation
Read the complete documentation at [https://roniemartinez.github.io/dude/](https://roniemartinez.github.io/dude/).
All the advanced and useful features are documented there.## Requirements
- ✅ Any dude should know how to work with selectors (CSS or XPath).
- ✅ Familiarity with any backends that you love (see [Supported Parser Backends](#supported-parser-backends))
- ✅ Python decorators... you'll live, dude!## Why name this project "dude"?
- ✅ A [Recursive acronym](https://en.wikipedia.org/wiki/Recursive_acronym) looks nice.
- ✅ Adding "uncomplicated" (like [`ufw`](https://wiki.ubuntu.com/UncomplicatedFirewall)) into the name says it is a very simple framework.
- ✅ Puns! I also think that if you want to do web scraping, there's probably some random dude around the corner who can make it very easy for you to start with it. 😊## Author
[Ronie Martinez](mailto:[email protected])
## Contributors ✨
Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):
Ronie Martinez
🚧 💻 📖 🚇
This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!