{"id":13689007,"url":"https://github.com/roniemartinez/dude","last_synced_at":"2025-03-16T22:06:00.237Z","repository":{"id":37770061,"uuid":"459162171","full_name":"roniemartinez/dude","owner":"roniemartinez","description":"dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators","archived":false,"fork":false,"pushed_at":"2025-03-03T11:37:04.000Z","size":2632,"stargazers_count":428,"open_issues_count":31,"forks_count":19,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-09T21:42:55.831Z","etag":null,"topics":["async","beautifulsoup4","crawler","css","framework","lxml","parsel","playwright","python","scraper","scraping","selenium","sync","web-scraping","webscraping","xpath"],"latest_commit_sha":null,"homepage":"https://roniemartinez.github.io/dude/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/roniemartinez.png","metadata":{"funding":{"buy_me_a_coffee":"roniemartinez","github":["roniemartinez"]},"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":"docs/supported_parser_backends/index.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-14T12:55:45.000Z","updated_at":"2025-02-12T21:32:48.000Z","dependencies_parsed_at":"2023-09-27T02:41:12.953Z","dependency_job_id":"c67db214-8eaa-4041-9b03-d86d0b47d1c1","html_url":"https://github.com/roniemartinez/dude","commit_stats":{"total_commits":460,"total_committers":3,"mean_commits":"153.33333333333334","dds":"0.21739130434782605","last_synced_commit":"f162a983aeb28c23bc55953708badbd2c08e6028"},"previous_names":[],"tags_count":45,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roniemartinez%2Fdude","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roniemartinez%2Fdude/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roniemartinez%2Fdude/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/roniemartinez%2Fdude/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/roniemartinez","download_url":"https://codeload.github.com/roniemartinez/dude/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243940138,"owners_count":20372047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["async","beautifulsoup4","crawler","css","framework","lxml","parsel","playwright","python","scraper","scraping","selenium","sync","web-scraping","webscraping","xpath"],"created_at":"2024-08-02T15:01:30.199Z","updated_at":"2025-03-16T22:06:00.202Z","avatar_url":"https://github.com/roniemartinez.png","language":"Python","funding_links":["https://buymeacoffee.com/roniemartinez","https://github.com/sponsors/roniemartinez"],"categories":["Python"],"sub_categories":[],"readme":"**Archived!!! I can no longer maintain this repository.**\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eLicense\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src='https://img.shields.io/pypi/l/pydude.svg?style=for-the-badge' alt=\"License\"\u003e\u003c/td\u003e\n        \u003ctd\u003eVersion\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src='https://img.shields.io/pypi/v/pydude.svg?logo=pypi\u0026style=for-the-badge' alt=\"Version\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eGithub Actions\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src='https://img.shields.io/github/actions/workflow/status/roniemartinez/dude/python.yml?branch=master\u0026label=actions\u0026logo=github%20actions\u0026style=for-the-badge' alt=\"Github Actions\"\u003e\u003c/td\u003e\n        \u003ctd\u003eCoverage\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src='https://img.shields.io/codecov/c/github/roniemartinez/dude/master?label=codecov\u0026logo=codecov\u0026style=for-the-badge' alt=\"CodeCov\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eSupported versions\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src='https://img.shields.io/pypi/pyversions/pydude.svg?logo=python\u0026style=for-the-badge' alt=\"Python Versions\"\u003e\u003c/td\u003e\n        \u003ctd\u003eWheel\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src='https://img.shields.io/pypi/wheel/pydude.svg?style=for-the-badge' alt=\"Wheel\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eStatus\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src='https://img.shields.io/pypi/status/pydude.svg?style=for-the-badge' alt=\"Status\"\u003e\u003c/td\u003e\n        \u003ctd\u003eDownloads\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src='https://img.shields.io/pypi/dm/pydude.svg?style=for-the-badge' alt=\"Downloads\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003eAll Contributors\u003c/td\u003e\n        \u003ctd\u003e\u003ca href=\"#contributors-\"\u003e\u003cimg src='https://img.shields.io/github/all-contributors/roniemartinez/dude?style=for-the-badge' alt=\"All Contributors\"\u003e\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n# dude uncomplicated data extraction\n\nDude is a very simple framework for writing web scrapers using Python decorators.\nThe design, inspired by [Flask](https://github.com/pallets/flask), was to easily build a web scraper in just a few lines of code.\nDude has an easy-to-learn syntax.\n\n\u003e 🚨 Dude is currently in Pre-Alpha. Please expect breaking changes.\n\n## Installation\n\nTo install, simply run the following from terminal.\n\n```bash\npip install pydude\nplaywright install  # Install playwright binaries for Chrome, Firefox and Webkit.\n```\n\n## Minimal web scraper\n\nThe simplest web scraper will look like this:\n\n```python\nfrom dude import select\n\n\n@select(css=\"a\")\ndef get_link(element):\n    return {\"url\": element.get_attribute(\"href\")}\n```\n\nThe example above will get all the [hyperlink](https://en.wikipedia.org/wiki/Hyperlink#HTML) elements in a page and calls the handler function `get_link()` for each element.\n\n## How to run the scraper\n\nYou can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to `dude scrape` command.\n\n```bash\ndude scrape --url \"\u003curl\u003e\" --output data.json path/to/script.py\n```\n\nThe output in `data.json` should contain the actual URL and the metadata prepended with underscore.\n\n```json5\n[\n  {\n    \"_page_number\": 1,\n    \"_page_url\": \"https://dude.ron.sh/\",\n    \"_group_id\": 4502003824,\n    \"_group_index\": 0,\n    \"_element_index\": 0,\n    \"url\": \"/url-1.html\"\n  },\n  {\n    \"_page_number\": 1,\n    \"_page_url\": \"https://dude.ron.sh/\",\n    \"_group_id\": 4502003824,\n    \"_group_index\": 0,\n    \"_element_index\": 1,\n    \"url\": \"/url-2.html\"\n  },\n  {\n    \"_page_number\": 1,\n    \"_page_url\": \"https://dude.ron.sh/\",\n    \"_group_id\": 4502003824,\n    \"_group_index\": 0,\n    \"_element_index\": 2,\n    \"url\": \"/url-3.html\"\n  }\n]\n```\n\nChanging the output to `--output data.csv` should result in the following CSV content.\n\n![data.csv](docs/csv.png)\n\n## Features\n\n- Simple [Flask](https://github.com/pallets/flask)-inspired design - build a scraper with decorators.\n- Uses [Playwright](https://playwright.dev/python/) API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.\n- Data grouping - group related results.\n- URL pattern matching - run functions on matched URLs.\n- Priority - reorder functions based on priority.\n- Setup function - enable setup steps (clicking dialogs or login).\n- Navigate function - enable navigation steps to move to other pages.\n- Custom storage - option to save data to other formats or database.\n- Async support - write async handlers.\n- Option to use other parser backends aside from Playwright.\n  - [BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html) - `pip install pydude[bs4]`\n  - [Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html) - `pip install pydude[parsel]`\n  - [lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html) - `pip install pydude[lxml]`\n  - [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html) - `pip install pydude[selenium]`\n- Option to follow all links indefinitely (Crawler/Spider).\n- Events - attach functions to startup, pre-setup, post-setup and shutdown events.\n- Option to save data on every page.\n\n## Supported Parser Backends\n\nBy default, Dude uses Playwright but gives you an option to use parser backends that you are familiar with.\nIt is possible to use parser backends like \n[BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html), \n[Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html),\n[lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html),\nand [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html).\n\nHere is the summary of features supported by each parser backend.\n\n\u003ctable\u003e\n\u003cthead\u003e\n  \u003ctr\u003e\n    \u003ctd rowspan=\"2\" style='text-align:center;'\u003eParser Backend\u003c/td\u003e\n    \u003ctd rowspan=\"2\" style='text-align:center;'\u003eSupports\u003cbr\u003eSync?\u003c/td\u003e\n    \u003ctd rowspan=\"2\" style='text-align:center;'\u003eSupports\u003cbr\u003eAsync?\u003c/td\u003e\n    \u003ctd colspan=\"4\" style='text-align:center;'\u003eSelectors\u003c/td\u003e\n    \u003ctd rowspan=\"2\" style='text-align:center;'\u003e\u003ca href=\"https://roniemartinez.github.io/dude/advanced/01_setup.html\"\u003eSetup\u003cbr\u003eHandler\u003c/a\u003e\u003c/td\u003e\n    \u003ctd rowspan=\"2\" style='text-align:center;'\u003e\u003ca href=\"https://roniemartinez.github.io/dude/advanced/02_navigate.html\"\u003eNavigate\u003cbr\u003eHandler\u003c/a\u003e\u003c/td\u003e\n    \u003ctd rowspan=\"2\" style='text-align:center;'\u003eComments\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eCSS\u003c/td\u003e\n    \u003ctd\u003eXPath\u003c/td\u003e\n    \u003ctd\u003eText\u003c/td\u003e\n    \u003ctd\u003eRegex\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ePlaywright\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eBeautifulSoup4\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eParsel\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003elxml\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ePyppeteer\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003eNot supported from 0.23.0\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSelenium\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e🚫\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e✅\u003c/td\u003e\n    \u003ctd\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n## Using the Docker image\n\nPull the docker image using the following command.\n\n```console\ndocker pull roniemartinez/dude\n```\n\nAssuming that `script.py` exist in the current directory, run Dude using the following command.\n\n```console\ndocker run -it --rm -v \"$PWD\":/code roniemartinez/dude dude scrape --url \u003curl\u003e script.py\n```\n\n## Documentation\n\nRead the complete documentation at [https://roniemartinez.github.io/dude/](https://roniemartinez.github.io/dude/).\nAll the advanced and useful features are documented there.\n\n## Requirements\n\n- ✅ Any dude should know how to work with selectors (CSS or XPath).\n- ✅ Familiarity with any backends that you love (see [Supported Parser Backends](#supported-parser-backends))\n- ✅ Python decorators... you'll live, dude!\n\n## Why name this project \"dude\"?\n\n- ✅ A [Recursive acronym](https://en.wikipedia.org/wiki/Recursive_acronym) looks nice.\n- ✅ Adding \"uncomplicated\" (like [`ufw`](https://wiki.ubuntu.com/UncomplicatedFirewall)) into the name says it is a very simple framework. \n- ✅ Puns! I also think that if you want to do web scraping, there's probably some random dude around the corner who can make it very easy for you to start with it. 😊\n\n## Author\n\n[Ronie Martinez](mailto:ronmarti18@gmail.com)\n\n## Contributors ✨\n\nThanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):\n\n\u003c!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section --\u003e\n\u003c!-- prettier-ignore-start --\u003e\n\u003c!-- markdownlint-disable --\u003e\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003ca href=\"https://ron.sh\"\u003e\u003cimg src=\"https://avatars.githubusercontent.com/u/2573537?v=4?s=100\" width=\"100px;\" alt=\"\"/\u003e\u003cbr /\u003e\u003csub\u003e\u003cb\u003eRonie Martinez\u003c/b\u003e\u003c/sub\u003e\u003c/a\u003e\u003cbr /\u003e\u003ca href=\"#maintenance-roniemartinez\" title=\"Maintenance\"\u003e🚧\u003c/a\u003e \u003ca href=\"https://github.com/roniemartinez/dude/commits?author=roniemartinez\" title=\"Code\"\u003e💻\u003c/a\u003e \u003ca href=\"https://github.com/roniemartinez/dude/commits?author=roniemartinez\" title=\"Documentation\"\u003e📖\u003c/a\u003e \u003ca href=\"#infra-roniemartinez\" title=\"Infrastructure (Hosting, Build-Tools, etc)\"\u003e🚇\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003c!-- markdownlint-restore --\u003e\n\u003c!-- prettier-ignore-end --\u003e\n\n\u003c!-- ALL-CONTRIBUTORS-LIST:END --\u003e\n\nThis project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froniemartinez%2Fdude","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Froniemartinez%2Fdude","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froniemartinez%2Fdude/lists"}