{"id":13585669,"url":"https://github.com/lorey/mlscraper","last_synced_at":"2025-05-15T11:07:41.885Z","repository":{"id":37461537,"uuid":"283772187","full_name":"lorey/mlscraper","owner":"lorey","description":"🤖 Scrape data from HTML websites automatically by just providing examples","archived":false,"fork":false,"pushed_at":"2024-03-17T08:12:02.000Z","size":463,"stargazers_count":1355,"open_issues_count":16,"forks_count":90,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-05-14T11:11:04.552Z","etag":null,"topics":["crawler","crawler-python","crawling","extraction-engine","html","machine-learning","scraper","scraping"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/mlscraper/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lorey.png","metadata":{"files":{"readme":"README.rst","changelog":"HISTORY.rst","contributing":"CONTRIBUTING.rst","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.rst","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-07-30T12:45:54.000Z","updated_at":"2025-05-06T13:12:04.000Z","dependencies_parsed_at":"2023-01-31T02:01:05.277Z","dependency_job_id":"686920da-9b4b-4aae-8f58-69faaf3bed07","html_url":"https://github.com/lorey/mlscraper","commit_stats":{"total_commits":123,"total_committers":2,"mean_commits":61.5,"dds":0.008130081300813052,"last_synced_commit":"90551c0b5ec9099e217e62e0b732120db102c029"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lorey%2Fmlscraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lorey%2Fmlscraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lorey%2Fmlscraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lorey%2Fmlscraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lorey","download_url":"https://codeload.github.com/lorey/mlscraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254328384,"owners_count":22052632,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawler-python","crawling","extraction-engine","html","machine-learning","scraper","scraping"],"created_at":"2024-08-01T15:05:04.342Z","updated_at":"2025-05-15T11:07:41.868Z","avatar_url":"https://github.com/lorey.png","language":"Python","funding_links":[],"categories":["Python","scraping"],"sub_categories":[],"readme":"==================================================================================\nmlscraper: Scrape data from HTML pages automatically\n==================================================================================\n\n.. image:: https://img.shields.io/github/actions/workflow/status/lorey/mlscraper/tests.yml\n   :alt: CI status\n   :target: https://github.com/lorey/mlscraper/actions\n\n.. image:: https://img.shields.io/pypi/v/mlscraper\n   :alt: PyPI version\n   :target: https://pypi.org/project/mlscraper/\n\n.. image:: https://img.shields.io/pypi/pyversions/mlscraper\n   :alt: PyPI python version\n   :target: https://pypi.org/project/mlscraper/\n\n`mlscraper` allows you to extract structured data from HTML automatically\ninstead of manually specifying nodes or css selectors.\nYou train it by providing a few examples of your desired output.\nIt will then figure out the extraction rules for you automatically\nand afterwards you'll be able to extract data from any new page you provide.\n\n.. image:: .github/how-it-works.png\n   :alt: Image showing how mlscraper turns html into data objects\n\n----------------\nBackground Story\n----------------\n\nMany services for crawling and scraping automation allow you to select data in a browser and get JSON results in return.\nNo need to specify CSS selectors or anything else.\n\nI've been wondering for a long time why there's no Open Source solution that does something like this.\nSo here's my attempt at creating a python library to enable automatic scraping.\n\nAll you have to do is define some examples of scraped data.\n`mlscraper` will figure out everything else and return clean data.\n\n------------\nHow it works\n------------\n\nAfter you've defined the data you want to scrape, mlscraper will:\n\n- find your samples inside the HTML DOM\n- determine which rules/methods to apply for extraction\n- extract the data for you and return it in a dictionary\n\n---------------\nGetting started\n---------------\n\nmlscraper is currently shortly before version 1.0.\nIf you want to check the new release, use :code:`pip install --pre mlscraper` to test the release candidate.\nYou can also install the latest (unstable) development version of mlscraper\nvia :code:`pip install git+https://github.com/lorey/mlscraper#egg=mlscraper`,\ne.g. to check new features or to see if a bug has been fixed already.\nPlease note that until the 1.0 release :code:`pip install mlscraper` will return an outdated 0.* version.\n\n.. _examples: examples/\n\nTo get started with a simple scraped, check out a basic sample below.\n\n.. code-block:: python\n\n    import requests\n    from mlscraper.html import Page\n    from mlscraper.samples import Sample, TrainingSet\n    from mlscraper.training import train_scraper\n\n    # fetch the page to train\n    einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'\n    resp = requests.get(einstein_url)\n    assert resp.status_code == 200\n\n    # create a sample for Albert Einstein\n    # please add at least two samples in practice to get meaningful rules!\n    training_set = TrainingSet()\n    page = Page(resp.content)\n    sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})\n    training_set.add_sample(sample)\n\n    # train the scraper with the created training set\n    scraper = train_scraper(training_set)\n\n    # scrape another page\n    resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')\n    result = scraper.get(Page(resp.content))\n    print(result)\n    # returns {'name': 'J.K. Rowling', 'born': 'July 31, 1965'}\n\nCheck the examples_ directory for usage examples until further documentation arrives.\n\n-----------\nDevelopment\n-----------\n\nSee CONTRIBUTING.rst_\n\n.. _CONTRIBUTING.rst: /CONTRIBUTING.rst\n\n------------\nRelated work\n------------\n\nI originally called this autoscraper but while working on it someone else released a library named exactly the same.\nCheck it out here: autoscraper_.\nAlso, while initially driven by Machine Learning, using statistics to search for heuristics turned out to be faster and requires less training data.\nBut since the name is memorable, I'll keep it.\n\n.. _autoscraper: https://github.com/alirezamika/autoscraper\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Florey%2Fmlscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Florey%2Fmlscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Florey%2Fmlscraper/lists"}