{"id":13936644,"url":"https://github.com/CompileInc/hodor","last_synced_at":"2025-07-19T22:31:18.722Z","repository":{"id":10923682,"uuid":"67508734","full_name":"CompileInc/hodor","owner":"CompileInc","description":"🕷Configuration based html scraper","archived":false,"fork":false,"pushed_at":"2024-05-21T22:55:23.000Z","size":55,"stargazers_count":23,"open_issues_count":7,"forks_count":3,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-05-21T23:54:57.858Z","etag":null,"topics":["cssselect","hodor","html-scraper","lxml","pagination","python","scraping"],"latest_commit_sha":null,"homepage":"http://hodor.live","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CompileInc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-09-06T13:03:11.000Z","updated_at":"2024-05-29T23:53:40.414Z","dependencies_parsed_at":"2024-04-28T00:04:06.496Z","dependency_job_id":"4808a86b-b6ee-431d-a199-a0c8a1e6f287","html_url":"https://github.com/CompileInc/hodor","commit_stats":{"total_commits":77,"total_committers":8,"mean_commits":9.625,"dds":0.4415584415584416,"last_synced_commit":"afe0a726dc07d08ac1e0cb34c2330d72ca52aaff"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CompileInc%2Fhodor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CompileInc%2Fhodor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CompileInc%2Fhodor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CompileInc%2Fhodor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CompileInc","download_url":"https://codeload.github.com/CompileInc/hodor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226686729,"owners_count":17666928,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cssselect","hodor","html-scraper","lxml","pagination","python","scraping"],"created_at":"2024-08-07T23:02:52.995Z","updated_at":"2024-11-27T04:31:17.111Z","avatar_url":"https://github.com/CompileInc.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\n\n# Hodor [![PyPI](https://img.shields.io/pypi/v/hodorlive.svg?maxAge=2592000?style=plastic)](https://pypi.python.org/pypi/hodorlive/)\n\nA simple html scraper with xpath or css.\n\n## Install\n\n```pip install hodorlive```\n\n## Usage\n\n### As python package\n\n***WARNING: This package by default doesn't verify ssl connections. Please check the [arguments](#arguments) to enable them.***\n\n#### Sample code\n```python\nfrom hodor import Hodor\nfrom dateutil.parser import parse\n\n\ndef date_convert(data):\n    return parse(data)\n\nurl = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'\n\nCONFIG = {\n    'old_symbol': {\n        'css': '#SymbolChangeList_table tr td:nth-child(1)',\n        'many': True\n    },\n    'new_symbol': {\n        'css': '#SymbolChangeList_table tr td:nth-child(2)',\n        'many': True\n    },\n    'effective_date': {\n        'css': '#SymbolChangeList_table tr td:nth-child(3)',\n        'many': True,\n        'transform': date_convert\n    },\n    '_groups': {\n        'data': '__all__',\n        'ticker_changes': ['old_symbol', 'new_symbol']\n    },\n    '_paginate_by': {\n        'xpath': '//*[@id=\"two_column_main_content_lb_NextPage\"]/@href',\n        'many': False\n    }\n}\n\nh = Hodor(url=url, config=CONFIG, pagination_max_limit=5)\n\nh.data\n```\n#### Sample output\n```python\n{'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),\n           'new_symbol': 'ARNC',\n           'old_symbol': 'AA'},\n          {'effective_date': datetime.datetime(2016, 11, 1, 0, 0),\n           'new_symbol': 'ARNC$',\n           'old_symbol': 'AA$'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALN8',\n           'old_symbol': 'AHUSDN2018'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALN9',\n           'old_symbol': 'AHUSDN2019'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALQ6',\n           'old_symbol': 'AHUSDQ2016'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALQ7',\n           'old_symbol': 'AHUSDQ2017'},\n          {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),\n           'new_symbol': 'MALQ8',\n           'old_symbol': 'AHUSDQ2018'}]}\n```\n\n#### Arguments\n\n- ```ua``` (User-Agent)\n- ```proxies``` (check requesocks)\n- ```auth```\n- ```crawl_delay``` (crawl delay in seconds across pagination - default: 3 seconds)\n- ```pagination_max_limit``` (max number of pages to crawl - default: 100)\n- ```ssl_verify``` (default: False)\n- ```robots``` (if set respects robots.txt - default: True)\n- ```reppy_capacity``` (robots cache LRU capacity - default: 100)\n- ```trim_values``` (if set trims output for leading and trailing whitespace - default: True)\n\n\n#### Config parameters:\n- By default any key in the config is a rule to parse.\n    - Each rule can be either a ```xpath``` or a ```css```\n    - Each rule can extract ```many``` values by default unless explicity set to ```False```\n    - Each rule can allow to ```transform``` the result with a function if provided\n- Extra parameters include grouping (```_groups```) and pagination (```_paginate_by```) which is also of the rule format.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCompileInc%2Fhodor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCompileInc%2Fhodor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCompileInc%2Fhodor/lists"}