https://github.com/datalib/libextract

Extract data from websites using basic statistical magic
https://github.com/datalib/libextract

Last synced: 3 months ago
JSON representation

Extract data from websites using basic statistical magic

Host: GitHub
URL: https://github.com/datalib/libextract
Owner: datalib
License: mit
Created: 2015-04-06T08:27:32.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2020-10-02T11:31:24.000Z (almost 5 years ago)
Last Synced: 2025-03-02T08:50:12.840Z (4 months ago)
Language: Python
Homepage:
Size: 1.06 MB
Stars: 505
Watchers: 24
Forks: 45
Open Issues: 2
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

starred-awesome - libextract - Extract data from websites using basic statistical magic (Python)

README

        Libextract: extract data from websites

======================================

.. image:: https://travis-ci.org/datalib/libextract.svg?branch=master

    :target: https://travis-ci.org/datalib/libextract

::

        ___ __              __                  __

       / (_) /_  ___  _  __/ /__________ ______/ /_

      / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/

     / / / /_/ /  __/>   /_/ /  / /_/ / /__/ /_

    /_/_/_.___/\___/_/|_|\__/_/   \__,_/\___/\__/

Libextract is a `statistics-enabled `_

data extraction library that works on HTML and XML documents and written in 

Python. Originating from `eatiht `_, the

extraction algorithm works by making one simple assumption: *data appear as 

collections of repetitive elements*. You can read about the reasoning 

`here `_. 

Overview

--------

`libextract.api.extract(document, encoding='utf-8', count=5)` 

    Given an html *document*, and optionally the *encoding*, return

    a list of nodes likely containing data (5 by default).

Installation

------------

::

    pip install libextract

Usage

-----

Due to our simple definition of "data", we open up a single

interfaceable method. Post-processing is up to you. 

.. code-block:: python

    from requests import get

    from libextract.api import extract

    r = get('http://en.wikipedia.org/wiki/Information_extraction')

    textnodes = list(extract(r.content))

Using lxml's built-in methods for post-processing:

.. code-block:: python

    >> print(textnodes[0].text_content())

    Information extraction (IE) is the task of automatically extracting structured information...

The extraction algo is agnostic to article text as it is with

tabular data:

.. code-block:: python

    height_data = get("http://en.wikipedia.org/wiki/Human_height")

    tabs = list(extract(height_data.content))

    

.. code-block:: python

    >> [elem.text_content() for elem in tabs[0].iter('th')]

    ['Country/Region',

     'Average male height',

     'Average female height',

     ...]

Dependencies

~~~~~~~~~~~~

::

    lxml

    statscounter

Disclaimer

~~~~~~~~~~

This project is still in its infancy; and advice and suggestions as

to what this library could and should be would be greatly appreciated

:)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/datalib/libextract

Awesome Lists containing this project

README