Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/scrapinghub/mdr
A python library detect and extract listing data from HTML page.
https://github.com/scrapinghub/mdr
data-science
Last synced: about 2 months ago
JSON representation
A python library detect and extract listing data from HTML page.
- Host: GitHub
- URL: https://github.com/scrapinghub/mdr
- Owner: scrapinghub
- Created: 2014-06-13T12:50:22.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2017-05-05T18:38:48.000Z (over 7 years ago)
- Last Synced: 2024-09-22T22:18:34.816Z (3 months ago)
- Topics: data-science
- Language: C
- Size: 434 KB
- Stars: 109
- Watchers: 25
- Forks: 29
- Open Issues: 6
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.txt
Awesome Lists containing this project
README
===
MDR
===.. image:: https://travis-ci.org/scrapinghub/mdr.svg?branch=master
:target: https://travis-ci.org/scrapinghub/mdrMDR is a library detect and extract listing data from HTML page. It implemented base on the `Finding and Extracting Data Records from Web Pages `_ but
change the similarity to tree alignment proposed by `Web Data Extraction Based on Partial Tree Alignment `_ and `Automatic Wrapper Adaptation by Tree Edit Distance Matching `_.Requires
========``numpy`` and ``scipy`` must be installed to build this package.
Usage
=====Detect listing data
~~~~~~~~~~~~~~~~~~~MDR assume the data record close to the elements has most text nodes::
[1]: import requests
[2]: from mdr import MDR
[3]: mdr = MDR()
[4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')
[5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8'))
...[8]: [doc.getpath(c) for c in candidates[:10]]
['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul',
'/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]',
'/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]',
'/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div',
'/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]',
'/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul',
'/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]',
'/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody',
'/html/body/div[2]',
'/html/body/div[2]/div[4]/div/div[1]']Extract data record
~~~~~~~~~~~~~~~~~~~MDR can find the repetiton patterns by using tree matching under certain candidate DOM tree, then it builds a mapping from HTML element to other matched elements of the DOM tree.
Used with annotation (optional)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~You can annotate the seed elements with any tools (e.g. scrapely_) you like, then mdr will be able to find the other matched elements on the page.
e.g. you can find this demo page here_. the colored data in first row are annotated manually, the rest are extracted by MDR.
Author
======Terry Peng
License
=======MIT
.. _scrapely: https://github.com/scrapy/scrapely
.. _here: http://ibc.scrapinghub.com/tmp/h.html