{"id":17119573,"url":"https://github.com/fmilthaler/htmlparser","last_synced_at":"2025-03-24T02:21:10.324Z","repository":{"id":74419751,"uuid":"196575996","full_name":"fmilthaler/HTMLParser","owner":"fmilthaler","description":"Python class to scrap and parse a webpage (using requests, BeautifulSoup4), mainly for converting tables to pandas.DataFrame","archived":false,"fork":false,"pushed_at":"2019-07-16T13:13:14.000Z","size":5,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-29T08:27:11.238Z","etag":null,"topics":["html-parser","html-table-parser","scraping-websites"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fmilthaler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-12T12:31:38.000Z","updated_at":"2019-07-16T13:13:16.000Z","dependencies_parsed_at":null,"dependency_job_id":"df02c234-b464-411d-a8d9-79bf87c02ea8","html_url":"https://github.com/fmilthaler/HTMLParser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fmilthaler%2FHTMLParser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fmilthaler%2FHTMLParser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fmilthaler%2FHTMLParser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fmilthaler%2FHTMLParser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fmilthaler","download_url":"https://codeload.github.com/fmilthaler/HTMLParser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245196086,"owners_count":20575961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["html-parser","html-table-parser","scraping-websites"],"created_at":"2024-10-14T17:57:30.423Z","updated_at":"2025-03-24T02:21:10.295Z","avatar_url":"https://github.com/fmilthaler.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"left\"\u003e\n  \u003ca href=\"https://www.python.org/download/releases/3.0/\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/python-3+-brightgreen.svg?style=popout\" alt='python'\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/fmilthaler/HTMLParser/blob/master/LICENSE.txt\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/license/fmilthaler/HTMLParser.svg?style=popout\" alt=\"license\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n# HTMLParser\n*HTMLParser* is a class for scrapping and parsing a webpage. Especially useful for converting a table in HTML syntax to a `pandas.DataFrame`.\n\n## Example\nHere we scrap a page from Wikipedia, parse it for tables, and convert the first table found into a `pandas.DataFrame`.\n\n```\nfrom htmlparser import HTMLParser\nimport pandas\n\n# Here we scrap a page from Wikipedia, parse it for tables, and convert the first table found into a `pandas.DataFrame`.\nurl = \"https://en.wikipedia.org/wiki/List_of_S%26P_500_companies\"\nhp = HTMLParser(url)\n# scrapping the webpage\npage = hp.scrap_url()\n# extracting only tables from the webpage\nelement = 'table'\nparams = {'class': 'wikitable sortable'}\nelements = hp.get_page_elements(page, element=element, params=params)\n# get a pandas.DataFrame from the (first) html table\ndf = hp.parse_html_table(elements[0])\nprint(df.columns.values)\n```\n\nThis results in the following output (column headers):\n\n```\n['Symbol' 'Security' 'SEC filings' 'GICS Sector' 'GICS Sub Industry'\n 'Headquarters Location' 'Date first added' 'CIK' 'Founded']\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffmilthaler%2Fhtmlparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffmilthaler%2Fhtmlparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffmilthaler%2Fhtmlparser/lists"}