{"id":20710049,"url":"https://github.com/oxylabs/pandas-read-html-2","last_synced_at":"2026-05-08T05:50:03.619Z","repository":{"id":134336635,"uuid":"555288997","full_name":"oxylabs/pandas-read-html-2","owner":"oxylabs","description":"Learn how to use pandas to read HTMLs: Volume 2","archived":false,"fork":false,"pushed_at":"2025-02-11T12:57:27.000Z","size":260,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-11T13:44:12.822Z","etag":null,"topics":["github-python","pandas","pandas-python","pandas-read-html","python","python-library","read-html-directory"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-21T09:38:39.000Z","updated_at":"2025-02-11T12:57:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"a4ccc0c5-f0e7-4256-8014-2a31422d19d8","html_url":"https://github.com/oxylabs/pandas-read-html-2","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fpandas-read-html-2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fpandas-read-html-2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fpandas-read-html-2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fpandas-read-html-2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/pandas-read-html-2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242980784,"owners_count":20216285,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["github-python","pandas","pandas-python","pandas-read-html","python","python-library","read-html-directory"],"created_at":"2024-11-17T02:09:40.207Z","updated_at":"2025-12-24T05:30:44.632Z","avatar_url":"https://github.com/oxylabs.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# How to Read HTML Tables With Pandas\n\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877\u0026utm_medium=affiliate\u0026groupid=877\u0026utm_content=pandas-read-html-2-github\u0026transaction_id=102f49063ab94276ae8f116d224b67)\n\n[![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge\u0026theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge\u0026logo=youtube\u0026logoColor=white)](https://www.youtube.com/@oxylabs)\n\n\n\u003ca href=\"https://github.com/topics/pandas\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026amp;message=pandas\u0026amp;color=brightgreen\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e \u003ca href=\"https://github.com/topics/web-scraping\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026amp;message=Web%20Scraping\u0026amp;color=important\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e\n\n- [How to Read HTML Tables With Pandas](#how-to-read-html-tables-with-pandas)\n- [​Install Pandas](#install-pandas)\n- [Web scraping with Pandas](#web-scraping-with-pandas)\n  - [Importing Pandas​](#importing-pandas)\n  - [Reading Tables from Webpage](#reading-tables-from-webpage)\n  - [Preview Results](#preview-results)\n  - [Parsing Dates](#parsing-dates)\n  - [Locating Specific Table​s](#locating-specific-tables)\n  - [Scraping a Specific Column](#scraping-a-specific-column)\n  - [Skipping a Row](#skipping-a-row)\n- [Saving Data to CSV](#saving-data-to-csv)\n\n\nPandas library is made for handling data with columns and rows. Often, you would need to scrape HTML tables from web pages.\n\nThis guide demonstrates how to read HTML tables with pandas with a few simple steps.\n\n# ​Install Pandas\n\nTo install pandas, we recommend that you use Anaconda. Alternatively, you can install pandas without Anaconda using PIP. You can also install Jupyter Notebook with PIP as follows:\n\n```shell\npip install pandas\npip install notebook\n```\n\n# Web scraping with Pandas\n\n## Importing Pandas​\n\n```python\nimport pandas as pd\n```\n\n​\n\n## Reading Tables from Webpage\n\nUse the `read_html` function to parse tables from a webpage. This function returns a `list` of `DataFrames`\n\n```python\nurl = 'https://en.wikipedia.org/wiki/List_of_wealthiest_Americans_by_net_worth'\ndfs = pd.read_html(url)\ndf = dfs[0]\n```\n\nYou can use a different parser, such as BeautifulSopup by setting `flavor='bs4'` \n\n```python\ndfs = pd.read_html(url, flavor='bs4')\n```\n\n## Preview Results\n\n```python\ndf.head()\n```\n\n![Dataframe](images/df_preview.png)\n\n## Parsing Dates\n\nIn this example, the date contains other info that needs to be cleaned up:\n\n```python\ndf['Date of birth(age)'] = df['Date of birth(age)'].str.replace(r'\\(.*\\)', '', regex=True)\n```\n\nNext, convert this `obj`  datatype to a `datetime64` datatype as follows:\n\n```python\ndf['Date of birth(age)'] = pd.to_datetime(df['Date of birth(age)'])\n```\n\n## Locating Specific Table​s\n\nYou can use the `match` parameter to find only the tables that contain the desired text. \n\n```python\nurl = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'\ndfs = pd.read_html(url, flavor='bs4', match='Source\\(s\\) of wealth')\n```\n\n\n\n## Scraping a Specific Column\n\n`read_html` will return the entire table in a data frame. To get a specific column, use pandas filtering as follows:\n\n```python\ndf[['Name']]\n```\n\n![scraping one column](images/one_column.png)\n\n\n\n## Skipping a Row\n\nSee the following example:\n\n```python\nurl = 'https://en.wikipedia.org/wiki/Billionaire'\ndfs = pd.read_html(url, flavor='bs4',match='known billionaires')\n```\n\n![skipping rows](images/skip_rows.png)\n\nUsually, if you want to skip rows, you can use the `skiprows` parameter:\n\n```python\ndfs = pd.read_html(url, skiprows=1)\n```\n\nIn this case, we will have to remove one header row as follows:\n\n```python\ndf.droplevel(0,axis=1)\n```\n\n# Saving Data to CSV\n\nUse the `to_csv` method of the data frame object:\n\n```python\ndf.to_csv('file_name.csv',index=False)\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fpandas-read-html-2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Fpandas-read-html-2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fpandas-read-html-2/lists"}