Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yuanxu-li/html-table-extractor
extract data from html table
https://github.com/yuanxu-li/html-table-extractor
beautifulsoup crawler extract-data html html-table scraping table
Last synced: 3 months ago
JSON representation
extract data from html table
- Host: GitHub
- URL: https://github.com/yuanxu-li/html-table-extractor
- Owner: yuanxu-li
- License: mit
- Created: 2017-04-10T22:04:42.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2020-05-01T18:40:12.000Z (over 4 years ago)
- Last Synced: 2024-04-24T01:20:25.318Z (9 months ago)
- Topics: beautifulsoup, crawler, extract-data, html, html-table, scraping, table
- Language: Python
- Size: 31.3 KB
- Stars: 84
- Watchers: 3
- Forks: 23
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# HTML Table Extractor
[![Build Status](https://travis-ci.org/yuanxu-li/html-table-extractor.svg?branch=master)](https://travis-ci.org/yuanxu-li/html-table-extractor)_HTML Table Extractor is a python library that uses [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to extract data from complicated and messy html table_
## Important links
* Repository: https://github.com/yuanxu-li/html-table-extractor
* Issues: https://github.com/yuanxu-li/html-table-extractor/issues## Installation
```bash
pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor
```## Usage
### Example 1 - Simple
1234
```python
from html_table_extractor.extractor import Extractor
table_doc = """
1234
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2'], [u'3', u'4']]
```### Example 2 - Transformer
1234
```python
from html_table_extractor.extractor import Extractor
table_doc = """
1234
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[1, 2], [3, 4]]
```### Example 3 - Pass BS4 Tag
1234
```python
from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
1234not wanted
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2'], [u'3', u'4']]
```### Example 4 - Complex
1
2
3
4
5
```python
from html_table_extractor.extractor import Extractor
table_doc = """
1
2
3
4
5
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]
```### Example 5 - Conflicted
1
2
3
4
5
```python
from html_table_extractor.extractor import Extractor
table_doc = """
1
2
3
4
5
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()
```
It will print out:
```python
[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]
```### Example 6 - Write to file
1234
```python
from html_table_extractor.extractor import Extractor
table_doc = """
1234
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')
```
It will write to a given path and create a new csv file called `output.csv`:
```
1,2
3,4```
## Team
* [@yuanxu-li](https://github.com/yuanxu-li)
## Errors/ Bugs
If something is not working correctly, or if you have any suggestion on improvements, [report it here](https://github.com/yuanxu-li/table-extractor/issues)
## Copyright
Copyright (c) 2017 Justin Li. Released under the [MIT License](https://github.com/yuanxu-li/html-table-extractor/blob/master/README.md)
Third-party copyright in this distribution is noted where applicable.
## Misc
How to upload the package to pypi (for the reference of the owner)
- python setup.py bdist_wheel --universal
- twine upload dist/* --verbose