https://github.com/oxylabs/pandas-read-html-2

Learn how to use pandas to read HTMLs: Volume 2
https://github.com/oxylabs/pandas-read-html-2

github-python pandas pandas-python pandas-read-html python python-library read-html-directory

Last synced: 9 months ago
JSON representation

Learn how to use pandas to read HTMLs: Volume 2

Host: GitHub
URL: https://github.com/oxylabs/pandas-read-html-2
Owner: oxylabs
Created: 2022-10-21T09:38:39.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2025-02-11T12:57:27.000Z (10 months ago)
Last Synced: 2025-02-11T13:44:12.822Z (10 months ago)
Topics: github-python, pandas, pandas-python, pandas-read-html, python, python-library, read-html-directory
Language: Jupyter Notebook
Homepage:
Size: 254 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # How to Read HTML Tables With Pandas

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)

 

- [How to Read HTML Tables With Pandas](#how-to-read-html-tables-with-pandas)

- [Install Pandas](#install-pandas)

- [Web scraping with Pandas](#web-scraping-with-pandas)

  - [Importing Pandas](#importing-pandas)

  - [Reading Tables from Webpage](#reading-tables-from-webpage)

  - [Preview Results](#preview-results)

  - [Parsing Dates](#parsing-dates)

  - [Locating Specific Tables](#locating-specific-tables)

  - [Scraping a Specific Column](#scraping-a-specific-column)

  - [Skipping a Row](#skipping-a-row)

- [Saving Data to CSV](#saving-data-to-csv)

Pandas library is made for handling data with columns and rows. Often, you would need to scrape HTML tables from web pages.

This guide demonstrates how to read HTML tables with pandas with a few simple steps.

# Install Pandas

To install pandas, we recommend that you use Anaconda. Alternatively, you can install pandas without Anaconda using PIP. You can also install Jupyter Notebook with PIP as follows:

```shell

pip install pandas

pip install notebook

```

# Web scraping with Pandas

## Importing Pandas

```python

import pandas as pd

```



## Reading Tables from Webpage

Use the `read_html` function to parse tables from a webpage. This function returns a `list` of `DataFrames`

```python

url = 'https://en.wikipedia.org/wiki/List_of_wealthiest_Americans_by_net_worth'

dfs = pd.read_html(url)

df = dfs[0]

```

You can use a different parser, such as BeautifulSopup by setting `flavor='bs4'` 

```python

dfs = pd.read_html(url, flavor='bs4')

```

## Preview Results

```python

df.head()

```

![Dataframe](images/df_preview.png)

## Parsing Dates

In this example, the date contains other info that needs to be cleaned up:

```python

df['Date of birth(age)'] = df['Date of birth(age)'].str.replace(r'\(.*\)', '', regex=True)

```

Next, convert this `obj`  datatype to a `datetime64` datatype as follows:

```python

df['Date of birth(age)'] = pd.to_datetime(df['Date of birth(age)'])

```

## Locating Specific Tables

You can use the `match` parameter to find only the tables that contain the desired text. 

```python

url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'

dfs = pd.read_html(url, flavor='bs4', match='Source\(s\) of wealth')

```

## Scraping a Specific Column

`read_html` will return the entire table in a data frame. To get a specific column, use pandas filtering as follows:

```python

df[['Name']]

```

![scraping one column](images/one_column.png)

## Skipping a Row

See the following example:

```python

url = 'https://en.wikipedia.org/wiki/Billionaire'

dfs = pd.read_html(url, flavor='bs4',match='known billionaires')

```

![skipping rows](images/skip_rows.png)

Usually, if you want to skip rows, you can use the `skiprows` parameter:

```python

dfs = pd.read_html(url, skiprows=1)

```

In this case, we will have to remove one header row as follows:

```python

df.droplevel(0,axis=1)

```

# Saving Data to CSV

Use the `to_csv` method of the data frame object:

```python

df.to_csv('file_name.csv',index=False)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oxylabs/pandas-read-html-2

Awesome Lists containing this project

README