https://github.com/scrapinghub/mdr

A python library detect and extract listing data from HTML page.
https://github.com/scrapinghub/mdr

data-science

Last synced: 7 months ago
JSON representation

A python library detect and extract listing data from HTML page.

Host: GitHub
URL: https://github.com/scrapinghub/mdr
Owner: scrapinghub
Created: 2014-06-13T12:50:22.000Z (about 12 years ago)
Default Branch: master
Last Pushed: 2017-05-05T18:38:48.000Z (about 9 years ago)
Last Synced: 2025-04-03T15:02:16.039Z (over 1 year ago)
Topics: data-science
Language: C
Size: 434 KB
Stars: 108
Watchers: 24
Forks: 29
Open Issues: 6
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.txt

Awesome Lists containing this project

README

          ===

MDR

===

.. image:: https://travis-ci.org/scrapinghub/mdr.svg?branch=master

    :target: https://travis-ci.org/scrapinghub/mdr

MDR is a library detect and extract listing data from HTML page. It implemented base on the `Finding and Extracting Data Records from Web Pages `_ but

change the similarity to tree alignment proposed by `Web Data Extraction Based on Partial Tree Alignment `_ and `Automatic Wrapper Adaptation by Tree Edit Distance Matching `_.

Requires

========

``numpy`` and ``scipy`` must be installed to build this package.

Usage

=====

Detect listing data

~~~~~~~~~~~~~~~~~~~

MDR assume the data record close to the elements has most text nodes::

    [1]: import requests

    [2]: from mdr import MDR

    [3]: mdr = MDR()

    [4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')

    [5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8'))

    ...

    [8]: [doc.getpath(c) for c in candidates[:10]]

     ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul',

     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]',

     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]',

     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div',

     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]',

     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul',

     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]',

     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody',

     '/html/body/div[2]',

     '/html/body/div[2]/div[4]/div/div[1]']

Extract data record

~~~~~~~~~~~~~~~~~~~

MDR can find the repetiton patterns by using tree matching under certain candidate DOM tree, then it builds a mapping from HTML element to other matched elements of the DOM tree.

Used with annotation (optional)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can annotate the seed elements with any tools (e.g. scrapely_) you like, then mdr will be able to find the other matched elements on the page.

e.g. you can find this demo page here_. the colored data in first row are annotated manually, the rest are extracted by MDR.

Author

======

Terry Peng 

License

=======

MIT

.. _scrapely: https://github.com/scrapy/scrapely

.. _here: http://ibc.scrapinghub.com/tmp/h.html

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scrapinghub/mdr

Awesome Lists containing this project

README