{"id":19448057,"url":"https://github.com/scrapinghub/mdr","last_synced_at":"2025-12-14T14:36:53.008Z","repository":{"id":57440475,"uuid":"20803934","full_name":"scrapinghub/mdr","owner":"scrapinghub","description":"A python library detect and extract listing data from HTML page.","archived":false,"fork":false,"pushed_at":"2017-05-05T18:38:48.000Z","size":444,"stargazers_count":108,"open_issues_count":6,"forks_count":29,"subscribers_count":24,"default_branch":"master","last_synced_at":"2025-04-03T15:02:16.039Z","etag":null,"topics":["data-science"],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapinghub.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGES.txt","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-06-13T12:50:22.000Z","updated_at":"2025-02-16T12:52:05.000Z","dependencies_parsed_at":"2022-09-26T17:20:38.722Z","dependency_job_id":null,"html_url":"https://github.com/scrapinghub/mdr","commit_stats":null,"previous_names":["tpeng/mdr"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Fmdr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Fmdr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Fmdr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Fmdr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapinghub","download_url":"https://codeload.github.com/scrapinghub/mdr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250741872,"owners_count":21479682,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science"],"created_at":"2024-11-10T16:23:33.718Z","updated_at":"2025-12-14T14:36:47.925Z","avatar_url":"https://github.com/scrapinghub.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"===\nMDR\n===\n\n.. image:: https://travis-ci.org/scrapinghub/mdr.svg?branch=master\n    :target: https://travis-ci.org/scrapinghub/mdr\n\nMDR is a library detect and extract listing data from HTML page. It implemented base on the `Finding and Extracting Data Records from Web Pages \u003chttp://dl.acm.org/citation.cfm?id=1743635\u003e`_ but\nchange the similarity to tree alignment proposed by `Web Data Extraction Based on Partial Tree Alignment \u003chttp://doi.acm.org/10.1145/1060745.1060761\u003e`_ and `Automatic Wrapper Adaptation by Tree Edit Distance Matching \u003chttp://arxiv.org/pdf/1103.1252.pdf\u003e`_.\n\n\nRequires\n========\n\n``numpy`` and ``scipy`` must be installed to build this package.\n\nUsage\n=====\n\nDetect listing data\n~~~~~~~~~~~~~~~~~~~\n\nMDR assume the data record close to the elements has most text nodes::\n\n    [1]: import requests\n    [2]: from mdr import MDR\n    [3]: mdr = MDR()\n    [4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')\n    [5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8'))\n    ...\n\n    [8]: [doc.getpath(c) for c in candidates[:10]]\n     ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul',\n     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]',\n     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]',\n     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div',\n     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]',\n     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul',\n     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]',\n     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody',\n     '/html/body/div[2]',\n     '/html/body/div[2]/div[4]/div/div[1]']\n\nExtract data record\n~~~~~~~~~~~~~~~~~~~\n\nMDR can find the repetiton patterns by using tree matching under certain candidate DOM tree, then it builds a mapping from HTML element to other matched elements of the DOM tree.\n\nUsed with annotation (optional)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nYou can annotate the seed elements with any tools (e.g. scrapely_) you like, then mdr will be able to find the other matched elements on the page.\n\ne.g. you can find this demo page here_. the colored data in first row are annotated manually, the rest are extracted by MDR.\n\nAuthor\n======\n\nTerry Peng \u003cpengtaoo@gmail.com\u003e\n\nLicense\n=======\n\nMIT\n\n.. _scrapely: https://github.com/scrapy/scrapely\n.. _here: http://ibc.scrapinghub.com/tmp/h.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapinghub%2Fmdr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapinghub%2Fmdr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapinghub%2Fmdr/lists"}