{"id":20710052,"url":"https://github.com/oxylabs/pandas-read-html-tables","last_synced_at":"2026-03-07T06:32:31.494Z","repository":{"id":134336665,"uuid":"469650772","full_name":"oxylabs/pandas-read-html-tables","owner":"oxylabs","description":"A tutorial on parsing HTML tables with pandas","archived":false,"fork":false,"pushed_at":"2025-06-26T08:23:37.000Z","size":70,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-26T09:31:05.850Z","etag":null,"topics":["github-python","pandas","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-03-14T08:52:22.000Z","updated_at":"2025-06-26T08:23:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"9f48a993-e7a2-4797-b459-1fe171492607","html_url":"https://github.com/oxylabs/pandas-read-html-tables","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/oxylabs/pandas-read-html-tables","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fpandas-read-html-tables","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fpandas-read-html-tables/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fpandas-read-html-tables/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fpandas-read-html-tables/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/pandas-read-html-tables/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fpandas-read-html-tables/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30209086,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T05:23:27.321Z","status":"ssl_error","status_checked_at":"2026-03-07T05:00:17.256Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["github-python","pandas","python"],"created_at":"2024-11-17T02:09:40.419Z","updated_at":"2026-03-07T06:32:31.480Z","avatar_url":"https://github.com/oxylabs.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# How to Read HTML Tables with Pandas\n\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877\u0026utm_medium=affiliate\u0026groupid=877\u0026utm_content=pandas-read-html-tables-github\u0026transaction_id=102f49063ab94276ae8f116d224b67)\n\n\n[\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026message=Pandas\u0026color=brightgreen\" /\u003e](https://github.com/topics/pandas) [\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026message=Python\u0026color=important\" /\u003e](https://github.com/topics/python)\n\n\n[![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge\u0026theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge\u0026logo=youtube\u0026logoColor=white)](https://www.youtube.com/@oxylabs)\n\n- [Getting Started](#getting-started)\n- [Cleanup and processing of HTML table data](#cleanup-and-processing-of-html-table-data)\n- [Invalid and imperfect HTML](#invalid-and-imperfect-html)\n- [Extracting HTML tables from files](#extracting-html-tables-from-files)\n- [Extracting HTML tables from URLs](#extracting-html-tables-from-urls)\n- [Analyzing and visualizing scraped data](#analyzing-and-visualizing-scraped-data)\n\n[Pandas](https://pandas.pydata.org/) is one of the most popular Python libraries for data analysis. This library has many useful functions. One of such functions is pandas `read_html`. It can convert HTML tables into pandas DataFrame efficiently. \n\nThis tutorial will show you how useful pandas `read_html` can be, especially when combined with other helpful functions.\n\nFor a detailed explanation, see our [blog post](https://oxy.yt/hrFW).\n\n## Getting Started\n\nPandas can be installed using the `pip` command or `conda` command if you’re using Anaconda.\n\n```shell\npip3 install pandas\nconda install pandas\n```\n\nYou must also install `lxml`, `html5lib`, `BeautifulSoup4`, and `Matplotlib` libraries to facilitate reading \u0026 parsing the HTML and plotting the information.\n\nHere are the `pip` commands to install:\n\n```shell\npip3 install lxml\npip3 install html5lib\npip3 install BeautifulSoup4\npip3 install matplotlib\n```\n\n If you are using the `conda` prompt, use the following commands:\n\n```shell\nconda install lxml\nconda install html5lib\nconda install BeautifulSoup4\nconda install matplotlib\n```\n\nIn the following line of the code, a variable contains HTML. You should note that we’re using Python’s triple quote conventions to store multiline strings in a variable easily.\n\n```python\nhtml = '''\n\u003ctable\u003e\n    \u003cthead\u003e\n        \u003ctr\u003e\n            \u003cth\u003eSequence\u003c/th\u003e\n            \u003cth\u003eCountry\u003c/th\u003e\n            \u003cth\u003ePopulation\u003c/th\u003e\n            \u003cth\u003eUpdated\u003c/th\u003e\n        \u003c/tr\u003e\n    \u003c/thead\u003e\n    \u003ctbody\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e1\u003c/td\u003e\n            \u003ctd\u003eChina\u003c/td\u003e\n            \u003ctd\u003e1,439,323,776\u003c/td\u003e\n            \u003ctd\u003e1-Dec-2020\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e2\u003c/td\u003e\n            \u003ctd\u003eIndia\u003c/td\u003e\n            \u003ctd\u003e1,380,004,385\u003c/td\u003e\n            \u003ctd\u003e1-Dec-2020\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e3\u003c/td\u003e\n            \u003ctd\u003eUnited States\u003c/td\u003e\n            \u003ctd\u003e331,002,651\u003c/td\u003e\n            \u003ctd\u003e1-Dec-2020\u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/tbody\u003e\n\u003c/table\u003e'''\n```\n\nThe next step is to import pandas and call the pandas `read_html` function:\n\n```python\nimport pandas as pd\n \ndf_list = pd.read_html(html) \n```\n\nNote that the pandas `read_html` function returns a list of Pandas `DataFrame` objects. This can be verified by checking the length of the `df_list` variable:\n\n```python\nprint(len(df_list))\n# OUTPUT: 1\n```\n\nLet’s check the content of the DataFrame by printing it. \n\n```\nprint(df_list[0])\n```\n\nWhen you run from the terminal, the data from HTML tables will be extracted and displayed as follows:\n\n```shell\n$ python3 read_html.py\n   Sequence        Country  Population     Updated\n0         1          China  1439323776  1-Dec-2020\n1         2          India  1380004385  1-Dec-2020\n2         3  United States   331002651  1-Dec-2020\n```\n\nIf you’re using Jupyter Notebook, the output of the same command will have a better appearance.\n\n![pandas DataFrame created from string](https://images.prismic.io/oxylabs-sm/MDAzMjY5ZTctMTJjMy00ZWQwLWE5YWQtMzQ5MzU0NTFhODQw_image1_df_from_string.png?auto=compress,format\u0026rect=0,0,1392,296\u0026w=1392\u0026h=296\u0026fm=webp\u0026q=75)\n\n## Cleanup and processing of HTML table data\n\nThe index column can be easily updated by calling the `set_index()` function of the DataFrame:\n\n```\npopulation = df_list[0].set_index('Sequence')\n```\n\nOnce again, let’s take a look at the output from the Jupyter Notebook of this new DataFrame.\n\n![pandas DataFrame after updating index column](https://images.prismic.io/oxylabs-sm/N2RkYjY1ZDAtMjBiNS00NTBjLWE5YjctNDg5Y2VmOWIyMThm_image2_fixed_index_column.png?auto=compress,format\u0026rect=0,0,1392,351\u0026w=1392\u0026h=351\u0026fm=webp\u0026q=75)\n\nThe data types can be checked by calling `info()` function of the DataFrame as follows:\n\n```python\npopulation.info()\n```\n\nThe output will be as follows:\n\n```shell\n\u003cclass 'pandas.core.frame.DataFrame'\u003e\nInt64Index: 3 entries, 1 to 3\nData columns (total 3 columns):\n #   Column      Non-Null Count  Dtype \n---  ------      --------------  ----- \n 0   Country     3 non-null      object\n 1   Population  3 non-null      int64 \n 2   Updated     3 non-null      object\ndtypes: int64(1), object(2)\n```\n\nNote the `Dtype` for the column `Updated` is `object`. It means that pandas `read_html` function didn’t understand that this column is date. \n\nThere are multiple ways to do this. The easiest of these methods is to use one more parameter of the pandas `read_html` function. This parameter is `parse_dates`:\n\n```python\npd.read_html(html, parse_dates=[3])\n# OR\npd.read_html(html, parse_dates=['Updated'])\n```\n\nThis time, if the `.info()` function is called, the DataFrame will have correct data types:\n\n![DataFrame with date-time data type](https://images.prismic.io/oxylabs-sm/OWFlMDIxNmYtMzRlYy00M2YyLWI5YzAtOTk4MjJlZTUyYTBl_image3_fixed_dates.png?auto=compress,format\u0026rect=0,0,1390,715\u0026w=1390\u0026h=715\u0026fm=webp\u0026q=75)\n\n## Invalid and imperfect HTML\n\nThe HTML that we used in the previous example is valid. If the heading in the HTML table is embedded in regular `\u003ctr\u003e` and `\u003ctd\u003e` tags, the DataFrame will be created with default numeric columns.\n\n![Column's headings as rows](https://images.prismic.io/oxylabs-sm/Y2FhNTNkMWQtMTJkYy00NWU4LTkwMGEtYTdiZjcwYmM2YmNm_image6_invalid_html.png?auto=compress,format\u0026rect=0,0,1386,327\u0026w=1386\u0026h=327\u0026fm=webp\u0026q=75)\n\nIn such cases, you can use another optional parameter of pandas `read_html` method as follows:\n\n```python\npd.read_html(html_no_head,header=0)\n```\n\n## Extracting HTML tables from files\n\nExtracting data from HTML tables that are in HTML files is almost the same as reading from strings.\n\nInstead of the HTML string, the pandas `read_html` needs the file path, relative or absolute. Assuming that the **population.html** file contains the HTML table with population information which is currently located in the **tmp** folder, we can read the HTML table as follows:\n\n```\npopulation_file= pd.read_html(\"/tmp/population.html\",parse_dates=['Updated'],index_col=0)\npopulation_file[0]\n```\n\n![HTML file converted to DataFrame](https://images.prismic.io/oxylabs-sm/OTNjOGZjNWYtMDg2OS00NjdhLTllYTEtZTE1OTQyODYwMGZi_image5_reading_files.png?auto=compress,format\u0026rect=0,0,1373,309\u0026w=1373\u0026h=309\u0026fm=webp\u0026q=75)\n\n## Extracting HTML tables from URLs\n\nPandas can directly connect to web URLs and read HTML tables. This functionality can be used for further [Python web scraping](https://oxylabs.io/blog/python-web-scraping). \n\nThe first step is to extract the list of tables using the Pandas `read_html` function. Next, we’ll check the length of the tables returned.\n\n```python\nimport pandas as pd\nlist_of_df = pd.read_html(\"https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels\u0026oldid=1091082777\")\nlen(list_of_df)\n# OUTPUT: 7\n```\n\nTo get to the exact table, there are multiple approaches possible. \n\nTo use regular expressions, first, we need to identify any pattern inside the `\u003ctable\u003e` that we want to scrape. Open the URL in a browser, right-click the table, and click inspect.\n\n![HTML table markup](https://images.prismic.io/oxylabs-sm/YTU2NDg4ZDYtY2RlYS00M2I5LWI4MjMtNTU2MGRjYWY1ZjZj_image7_wikipedia.png?auto=compress,format\u0026rect=0,0,1300,466\u0026w=1300\u0026h=466\u0026fm=webp\u0026q=75)\n\nThis regular expression can now be supplied to the optional parameter match of the pandas `read_html` function.\n\n```python\nimport pandas as pd\nlist_of_df = pd.read_html(\"https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels\u0026oldid=1091082777\", match='The 100 Best Novels')\nlen(list_of_df)\n# OUTPUT: 1\n```\n\nOne more way to extract the required table is by using the specific attributes:\n\n```python\npd.read_html(\"https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels\u0026oldid=1091082777\", attrs={'class':\"wikitable\"})\n```\n\n## Analyzing and visualizing scraped data\n\nLet’s find the author who has written most of the books in this Top 100 list:\n\n```python\ndf=list_of_df[0]\ndf.value_counts(subset=['Author'])\n```\n\nThis will print the following pandas series:\n\n```python\nAuthor               \nPhilip K. Dick           6\nJ. G. Ballard            4\nRobert A. Heinlein       3\nBrian Aldiss             3\nThomas M. Disch          3\n                        ..\n```\n\nIt gives us the information that Philip K. Dick has written 6 books out of these 100 best books. If needed, you can also plot charts to represent the same information.\n\n```python\ndf = df.value_counts(subset=['Author']).reset_index(name='BookCount')\n```\n\nThe next step is to make a subset of this DataFrame, where authors have published 3 or more books out of these Top 100:\n\n```python\ntop_df = df[df['BookCount'] \u003e= 3]\nprint(top_df)\n```\n\nThe output will be the following DataFrame:\n\n```python\n  Author  BookCount\n0      Philip K. Dick          6\n1       J. G. Ballard          4\n2  Robert A. Heinlein          3\n3        Brian Aldiss          3\n4     Thomas M. Disch          3\n```\n\nAnd finally, this data can be plotted as a horizontal bar chart:\n\n```python\ntop_df.plot.barh(x='Author',y='BookCount',figsize=(12,5))\n```\n\n![Authors with three or more books in Top 100 list](https://images.prismic.io/oxylabs-sm/NjRhOWRjMzctNjczOS00ZjlkLWJkOTAtODI0MGUyYWQxOWE4_image8_author_chart.png?auto=compress,format\u0026rect=0,0,1390,532\u0026w=1390\u0026h=532\u0026fm=webp\u0026q=75)\n\nIf you wish to find out more about How to Read HTML Tables with Pandas, see our [blog post](https://oxy.yt/hrFW).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fpandas-read-html-tables","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Fpandas-read-html-tables","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fpandas-read-html-tables/lists"}