{"id":22944792,"url":"https://github.com/nuncjo/delver","last_synced_at":"2025-08-12T22:32:39.277Z","repository":{"id":29245149,"uuid":"87230066","full_name":"nuncjo/Delver","owner":"nuncjo","description":"Programmatic web browser/crawler in Python. Alternative to Mechanize, RoboBrowser, MechanicalSoup and others. Strict power of Request and Lxml. Some features and methods usefull in scraping \"out of the box\".","archived":false,"fork":false,"pushed_at":"2022-12-26T20:24:44.000Z","size":506,"stargazers_count":3,"open_issues_count":3,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-12-10T11:24:39.646Z","etag":null,"topics":["crawling","lxml","mechanize","python","scraper","scraping","web"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nuncjo.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.rst","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-04-04T20:07:34.000Z","updated_at":"2020-05-01T10:35:02.000Z","dependencies_parsed_at":"2023-01-14T14:30:49.086Z","dependency_job_id":null,"html_url":"https://github.com/nuncjo/Delver","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuncjo%2FDelver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuncjo%2FDelver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuncjo%2FDelver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuncjo%2FDelver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nuncjo","download_url":"https://codeload.github.com/nuncjo/Delver/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229712166,"owners_count":18112265,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","lxml","mechanize","python","scraper","scraping","web"],"created_at":"2024-12-14T14:20:32.354Z","updated_at":"2024-12-14T14:20:32.986Z","avatar_url":"https://github.com/nuncjo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Delver\n========================\n\nProgrammatic web browser/crawler in Python. **Alternative to Mechanize, RoboBrowser, MechanicalSoup**\nand others. Strict power of Request and Lxml. Some features and methods usefull in scraping \"out of the box\".\n\n## Install\n\n```shell\n$ pip install delver\n```\n\n## Documentation\n\n[http://delver.readthedocs.io/en/latest/](http://delver.readthedocs.io/en/latest/)\n\n## Quick start - usage examples\n\n- [Basic examples](#basic-examples)\n    - [Form submit](#form-submit)\n    - [Find links narrowed by filters](#find-links-narrowed-by-filters)\n    - [Download file](#download-file)\n    - [Download files list in parallel](#download-files-list-in-parallel)\n    - [Xpath selectors](#xpath-selectors)\n    - [Css selectors](#css-selectors)\n    - [Xpath result with filters](#xpath-result-with-filters)\n- [Use examples](#use-examples)\n    - [Scraping Steam Specials using XPath](#scraping-steam-specials-using-xpath)\n    - [Simple tables scraping out of the box](#simple-tables-scraping-out-of-the-box)\n    - [User login](#user-login)\n    - [One Punch Man Downloader](#one-punch-man-downloader)\n\n- - -\n\n## Basic examples\n\n\n## Form submit\n\n```python\n\n        \u003e\u003e\u003e from delver import Crawler\n        \u003e\u003e\u003e c = Crawler()\n        \u003e\u003e\u003e response = c.open('https://httpbin.org/forms/post')\n        \u003e\u003e\u003e forms = c.forms()\n\n        # Filling up fields values:\n        \u003e\u003e\u003e form = forms[0]\n        \u003e\u003e\u003e form.fields = {\n        ...    'custname': 'Ruben Rybnik',\n        ...    'custemail': 'ruben.rybnik@fakemail.com',\n        ...    'size': 'medium',\n        ...    'topping': ['bacon', 'cheese'],\n        ...    'custtel': '+48606505888'\n        ... }\n        \u003e\u003e\u003e submit_result = c.submit(form)\n        \u003e\u003e\u003e submit_result.status_code\n        200\n\n        # Checking if form post ended with success:\n        \u003e\u003e\u003e c.submit_check(\n        ...    form,\n        ...    phrase=\"Ruben Rybnik\",\n        ...    url='https://httpbin.org/forms/post',\n        ...    status_codes=[200]\n        ... )\n        True\n```\n\n## Find links narrowed by filters\n\n```python\n\n        \u003e\u003e\u003e c = Crawler()\n        \u003e\u003e\u003e c.open('https://httpbin.org/links/10/0')\n        \u003cResponse [200]\u003e\n\n        # Links can be filtered by some html tags and filters\n        # like: id, text, title and class:\n        \u003e\u003e\u003e links = c.links(\n        ...     tags = ('style', 'link', 'script', 'a'),\n        ...     filters = {\n        ...         'text': '7'\n        ...     },\n        ...     match='NOT_EQUAL'\n        ... )\n        \u003e\u003e\u003e len(links)\n        8\n```\n\n## Download file\n\n```python\n\n        \u003e\u003e\u003e import os\n\n        \u003e\u003e\u003e c = Crawler()\n        \u003e\u003e\u003e local_file_path = c.download(\n        ...     local_path='test',\n        ...     url='https://httpbin.org/image/png',\n        ...     name='test.png'\n        ... )\n        \u003e\u003e\u003e os.path.isfile(local_file_path)\n        True\n```\n\n## Download files list in parallel\n\n```python\n\n        \u003e\u003e\u003e c = Crawler()\n        \u003e\u003e\u003e c.open('https://xkcd.com/')\n        \u003cResponse [200]\u003e\n        \u003e\u003e\u003e full_images_urls = [c.join_url(src) for src in c.images()]\n        \u003e\u003e\u003e downloaded_files = c.download_files('test', files=full_images_urls)\n        \u003e\u003e\u003e len(full_images_urls) == len(downloaded_files)\n        True\n```\n\n## Xpath selectors\n\n```python\n\n        c = Crawler()\n        c.open('https://httpbin.org/html')\n        p_text = c.xpath('//p/text()')\n```\n\n## Css selectors\n\n```python\n\n        c = Crawler()\n        c.open('https://httpbin.org/html')\n        p_text = c.css('div')\n```\n\n## Xpath result with filters\n\n```python\n\n        c = Crawler()\n        c.open('https://www.w3schools.com/')\n        filtered_results = c.xpath('//p').filter(filters={'class': 'w3-xlarge'})\n```\n\n## Using retries\n\n```python\n\n        c = Crawler()\n        # sets max_retries to 2 means that after there will be max two attempts to open url\n        # if first attempt will fail, wait 1 second and try again, second attempt wait 2 seconds\n        # and then try again\n        c.max_retries = 2\n        c.open('http://www.delver.cg/404')\n```\n\n## Use examples\n\n\n## Scraping Steam Specials using XPath\n\n```python\n\n    from pprint import pprint\n    from delver import Crawler\n\n    c = Crawler(absolute_links=True)\n    c.logging = True\n    c.useragent = \"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)\"\n    c.random_timeout = (0, 5)\n    c.open('http://store.steampowered.com/search/?specials=1')\n    titles, discounts, final_prices = [], [], []\n\n\n    while c.links(filters={\n        'class': 'pagebtn',\n        'text': '\u003e'\n    }):\n        c.open(c.current_results[0])\n        titles.extend(\n            c.xpath(\"//div/span[@class='title']/text()\")\n        )\n        discounts.extend(\n            c.xpath(\"//div[contains(@class, 'search_discount')]/span/text()\")\n        )\n        final_prices.extend(\n            c.xpath(\"//div[contains(@class, 'discounted')]//text()[2]\").strip()\n        )\n\n    all_results = {\n        row[0]: {\n            'discount': row[1],\n            'final_price': row[2]\n        } for row in zip(titles, discounts, final_prices)}\n    pprint(all_results)\n```\n\n## Simple tables scraping out of the box\n\n```python\n\n    from pprint import pprint\n    from delver import Crawler\n\n    c = Crawler(absolute_links=True)\n    c.logging = True\n    c.useragent = \"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)\"\n    c.open(\"http://www.boxofficemojo.com/daily/\")\n    pprint(c.tables())\n```\n\n## User login\n\n```python\n\n\n    from delver import Crawler\n\n    c = Crawler()\n    c.useragent = (\n        \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \"\n        \"Chrome/60.0.3112.90 Safari/537.36\"\n    )\n    c.random_timeout = (0, 5)\n    c.open('http://testing-ground.scraping.pro/login')\n    forms = c.forms()\n    if forms:\n        login_form = forms[0]\n        login_form.fields = {\n            'usr': 'admin',\n            'pwd': '12345'\n        }\n        c.submit(login_form)\n        success_check = c.submit_check(\n            login_form,\n            phrase='WELCOME :)',\n            status_codes=[200]\n        )\n        print(success_check)\n```\n\n## One Punch Man Downloader\n\n```python\n\n    import os\n    from delver import Crawler\n\n    class OnePunchManDownloader:\n        \"\"\"Downloads One Punch Man free manga chapers to local directories.\n        Uses one main thread for scraper with random timeout.\n        Uses 20 threads just for image downloads.\n        \"\"\"\n        def __init__(self):\n            self._target_directory = 'one_punch_man'\n            self._start_url = \"http://m.mangafox.me/manga/onepunch_man_one/\"\n            self.crawler = Crawler()\n            self.crawler.random_timeout = (0, 5)\n            self.crawler.useragent = \"Googlebot-Image/1.0\"\n\n        def run(self):\n            self.crawler.open(self._start_url)\n            for link in self.crawler.links(filters={'text': 'Ch '}, match='IN'):\n                self.download_images(link)\n\n        def download_images(self, link):\n            target_path = '{}/{}'.format(self._target_directory, link.split('/')[-2])\n            full_chapter_url = link.replace('/manga/', '/roll_manga/')\n            self.crawler.open(full_chapter_url)\n            images = self.crawler.xpath(\"//img[@class='reader-page']/@data-original\")\n            os.makedirs(target_path, exist_ok=True)\n            self.crawler.download_files(target_path, files=images, workers=20)\n\n\n    downloader = OnePunchManDownloader()\n    downloader.run()\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnuncjo%2Fdelver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnuncjo%2Fdelver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnuncjo%2Fdelver/lists"}