https://github.com/fmilthaler/htmlparser

Python class to scrap and parse a webpage (using requests, BeautifulSoup4), mainly for converting tables to pandas.DataFrame
https://github.com/fmilthaler/htmlparser

html-parser html-table-parser scraping-websites

Last synced: about 1 year ago
JSON representation

Python class to scrap and parse a webpage (using requests, BeautifulSoup4), mainly for converting tables to pandas.DataFrame

Host: GitHub
URL: https://github.com/fmilthaler/htmlparser
Owner: fmilthaler
License: mit
Created: 2019-07-12T12:31:38.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2019-07-16T13:13:14.000Z (almost 7 years ago)
Last Synced: 2025-01-29T08:27:11.238Z (about 1 year ago)
Topics: html-parser, html-table-parser, scraping-websites
Language: Python
Size: 4.88 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  

    

  

  

    

  



# HTMLParser

*HTMLParser* is a class for scrapping and parsing a webpage. Especially useful for converting a table in HTML syntax to a `pandas.DataFrame`.

## Example

Here we scrap a page from Wikipedia, parse it for tables, and convert the first table found into a `pandas.DataFrame`.

```

from htmlparser import HTMLParser

import pandas

# Here we scrap a page from Wikipedia, parse it for tables, and convert the first table found into a `pandas.DataFrame`.

url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

hp = HTMLParser(url)

# scrapping the webpage

page = hp.scrap_url()

# extracting only tables from the webpage

element = 'table'

params = {'class': 'wikitable sortable'}

elements = hp.get_page_elements(page, element=element, params=params)

# get a pandas.DataFrame from the (first) html table

df = hp.parse_html_table(elements[0])

print(df.columns.values)

```

This results in the following output (column headers):

```

['Symbol' 'Security' 'SEC filings' 'GICS Sector' 'GICS Sub Industry'

 'Headquarters Location' 'Date first added' 'CIK' 'Founded']

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fmilthaler/htmlparser

Awesome Lists containing this project

README