https://github.com/yuanxu-li/html-table-extractor

extract data from html table
https://github.com/yuanxu-li/html-table-extractor

beautifulsoup crawler extract-data html html-table scraping table

Last synced: over 1 year ago
JSON representation

extract data from html table

Host: GitHub
URL: https://github.com/yuanxu-li/html-table-extractor
Owner: yuanxu-li
License: mit
Created: 2017-04-10T22:04:42.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2020-05-01T18:40:12.000Z (about 6 years ago)
Last Synced: 2024-04-24T01:20:25.318Z (over 2 years ago)
Topics: beautifulsoup, crawler, extract-data, html, html-table, scraping, table
Language: Python
Size: 31.3 KB
Stars: 84
Watchers: 3
Forks: 23
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # HTML Table Extractor

[![Build Status](https://travis-ci.org/yuanxu-li/html-table-extractor.svg?branch=master)](https://travis-ci.org/yuanxu-li/html-table-extractor)

_HTML Table Extractor is a python library that uses [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to extract data from complicated and messy html table_

## Important links

* Repository: https://github.com/yuanxu-li/html-table-extractor

* Issues: https://github.com/yuanxu-li/html-table-extractor/issues

## Installation

```bash

pip install 'beautifulsoup4==4.5.3'

pip install html-table-extractor

```

## Usage

### Example 1 - Simple

1234

```python

from html_table_extractor.extractor import Extractor

table_doc = """

1234

"""

extractor = Extractor(table_doc)

extractor.parse()

extractor.return_list()

```

It will print out:

```python

[[u'1', u'2'], [u'3', u'4']]

```

### Example 2 - Transformer

1234

```python

from html_table_extractor.extractor import Extractor

table_doc = """

1234

"""

extractor = Extractor(table_doc, transformer=int)

extractor.parse()

extractor.return_list()

```

It will print out:

```python

[[1, 2], [3, 4]]

```

### Example 3 - Pass BS4 Tag

1234

```python

from html_table_extractor.extractor import Extractor

from bs4 import BeautifulSoup

table_doc = """

1234not wanted

"""

soup = BeautifulSoup(table_doc, 'html.parser')

extractor = Extractor(soup, id_='wanted')

extractor.parse()

extractor.return_list()

```

It will print out:

```python

[[u'1', u'2'], [u'3', u'4']]

```

### Example 4 - Complex

    

        1

        2

        3

    

    

        4

    

    

        5

    

```python

from html_table_extractor.extractor import Extractor

table_doc = """

  

    1

    2

    3

  

  

    4

  

  

    5

  

"""

extractor = Extractor(table_doc)

extractor.parse()

extractor.return_list()

```

It will print out:

```python

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

```

### Example 5 - Conflicted

    

        1

        2

        3

    

    

        4

    

    

        5

    

```python

from html_table_extractor.extractor import Extractor

table_doc = """

    

        1

        2

        3

    

    

        4

    

    

        5

    

"""

extractor = Extractor(table_doc)

extractor.parse()

extractor.return_list()

```

It will print out:

```python

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

```

### Example 6 - Write to file

1234

```python

from html_table_extractor.extractor import Extractor

table_doc = """

1234

"""

extractor = Extractor(table_doc).parse()

extractor.write_to_csv(path='.')

```

It will write to a given path and create a new csv file called `output.csv`:

```

1,2

3,4

```

## Team

* [@yuanxu-li](https://github.com/yuanxu-li)

## Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, [report it here](https://github.com/yuanxu-li/table-extractor/issues)

## Copyright

Copyright (c) 2017 Justin Li. Released under the [MIT License](https://github.com/yuanxu-li/html-table-extractor/blob/master/README.md)

Third-party copyright in this distribution is noted where applicable.

## Misc

How to upload the package to pypi (for the reference of the owner)

- python setup.py bdist_wheel --universal

- twine upload dist/* --verbose

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yuanxu-li/html-table-extractor

Awesome Lists containing this project

README