{"id":13414819,"url":"https://github.com/scrapy/scrapely","last_synced_at":"2025-04-13T10:04:58.292Z","repository":{"id":45430730,"uuid":"1632998","full_name":"scrapy/scrapely","owner":"scrapy","description":"A pure-python HTML screen-scraping library","archived":false,"fork":false,"pushed_at":"2022-04-04T10:53:21.000Z","size":572,"stargazers_count":1870,"open_issues_count":32,"forks_count":273,"subscribers_count":122,"default_branch":"master","last_synced_at":"2025-04-06T06:05:46.971Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapy.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-04-18T22:38:56.000Z","updated_at":"2025-04-04T13:19:56.000Z","dependencies_parsed_at":"2022-07-19T21:59:18.849Z","dependency_job_id":null,"html_url":"https://github.com/scrapy/scrapely","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fscrapely","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fscrapely/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fscrapely/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy%2Fscrapely/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapy","download_url":"https://codeload.github.com/scrapy/scrapely/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248695438,"owners_count":21146954,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T21:00:37.269Z","updated_at":"2025-04-13T10:04:58.270Z","avatar_url":"https://github.com/scrapy.png","language":"HTML","funding_links":[],"categories":["All","HTML","Python","Frameworks","HTML parser"],"sub_categories":[],"readme":"========\nScrapely\n========\n\n.. image:: https://api.travis-ci.org/scrapy/scrapely.svg?branch=master\n    :target: https://travis-ci.org/scrapy/scrapely\n\nScrapely is a library for extracting structured data from HTML pages. Given\nsome example web pages and the data to be extracted, scrapely constructs a\nparser for all similar pages.\n\nOverview\n========\n\nScrapinghub wrote a nice `blog post`_ explaining how scrapely works and how it's used in Portia_.\n\n.. _blog post: https://blog.scrapinghub.com/2016/07/07/scrapely-the-brains-behind-portia-spiders/\n.. _Portia: http://portia.readthedocs.io/\n\nInstallation\n============\n\nScrapely works in Python 2.7 or 3.3+.\nIt requires numpy and w3lib Python packages.\n\nTo install scrapely on any platform use::\n\n    pip install scrapely\n\nIf you're using Ubuntu (9.10 or above), you can install scrapely from the\nScrapy Ubuntu repos. Just add the Ubuntu repos as described here:\nhttp://doc.scrapy.org/en/latest/topics/ubuntu.html\n\nAnd then install scrapely with::\n\n    aptitude install python-scrapely\n\nUsage (API)\n===========\n\nScrapely has a powerful API, including a template format that can be edited\nexternally, that you can use to build very capable scrapers.\n\nWhat follows is a quick example of the simplest possible usage, that you can\nrun in a Python shell.\n\nStart by importing and instantiating the Scraper class::\n\n    \u003e\u003e\u003e from scrapely import Scraper\n    \u003e\u003e\u003e s = Scraper()\n\nThen, proceed to train the scraper by adding some page and the data you expect\nto scrape from there (note that all keys and values in the data you pass must\nbe strings)::\n\n    \u003e\u003e\u003e url1 = 'http://pypi.python.org/pypi/w3lib/1.1'\n    \u003e\u003e\u003e data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}\n    \u003e\u003e\u003e s.train(url1, data)\n\nFinally, tell the scraper to scrape any other similar page and it will return\nthe results::\n\n    \u003e\u003e\u003e url2 = 'http://pypi.python.org/pypi/Django/1.3'\n    \u003e\u003e\u003e s.scrape(url2)\n    [{u'author': [u'Django Software Foundation \u0026lt;foundation at djangoproject com\u0026gt;'],\n      u'description': [u'A high-level Python Web framework that encourages rapid development and clean, pragmatic design.'],\n      u'name': [u'Django 1.3']}]\n\nThat's it! No xpaths, regular expressions, or hacky python code.\n\nUsage (command line tool)\n=========================\n\nThere is also a simple script to create and manage Scrapely scrapers.\n\nIt supports a command-line interface, and an interactive prompt. All commands\nsupported on interactive prompt are also supported in the command-line\ninterface.\n\nTo enter the interactive prompt type the following without arguments::\n\n    python -m scrapely.tool myscraper.json\n\nExample::\n\n    $ python -m scrapely.tool myscraper.json\n    scrapely\u003e help\n\n    Documented commands (type help \u003ctopic\u003e):\n    ========================================\n    a  al  s  ta  td  tl\n\n    scrapely\u003e\n\nTo create a scraper and add a template::\n\n    scrapely\u003e ta http://pypi.python.org/pypi/w3lib/1.1\n    [0] http://pypi.python.org/pypi/w3lib/1.1\n\nThis is equivalent as typing the following in one command::\n\n    python -m scrapely.tool myscraper.json ta http://pypi.python.org/pypi/w3lib/1.1\n\nTo list available templates from a scraper::\n\n    scrapely\u003e tl\n    [0] http://pypi.python.org/pypi/w3lib/1.1\n\nTo add a new annotation, you usually test the selection criteria first::\n\n    scrapely\u003e t 0 w3lib 1.1\n    [0] u'\u003ch1\u003ew3lib 1.1\u003c/h1\u003e'\n    [1] u'\u003ctitle\u003ePython Package Index : w3lib 1.1\u003c/title\u003e'\n\nYou can also quote the text, if you need to specify an arbitrary number of\nspaces, for example::\n\n    scrapely\u003e t 0 \"w3lib 1.1\"\n\nYou can refine by position. To take the one in position [0]::\n\n    scrapely\u003e a 0 w3lib 1.1 -n 0\n    [0] u'\u003ch1\u003ew3lib 1.1\u003c/h1\u003e'\n\nTo annotate some fields on the template::\n\n    scrapely\u003e a 0 w3lib 1.1 -n 0 -f name\n    [new] (name) u'\u003ch1\u003ew3lib 1.1\u003c/h1\u003e'\n    scrapely\u003e a 0 Scrapy project -n 0 -f author\n    [new] u'\u003cspan\u003eScrapy project\u003c/span\u003e'\n\nTo list annotations on a template::\n\n    scrapely\u003e al 0\n    [0-0] (name) u'\u003ch1\u003ew3lib 1.1\u003c/h1\u003e'\n    [0-1] (author) u'\u003cspan\u003eScrapy project\u003c/span\u003e'\n\nTo scrape another similar page with the already added templates::\n\n    scrapely\u003e s http://pypi.python.org/pypi/Django/1.3\n    [{u'author': [u'Django Software Foundation'], u'name': [u'Django 1.3']}]\n\n\nTests\n=====\n\n`tox`_ is the preferred way to run tests. Just run: ``tox`` from the root\ndirectory.\n\nSupport\n=======\n\n* Mailing list: https://groups.google.com/forum/#!forum/scrapely\n* IRC: `scrapy@freenode`_\n\nScrapely is created and maintained by the Scrapy group, so you can get help\nthrough the usual support channels described in the `Scrapy community`_ page.\n\nArchitecture\n============\n\nUnlike most scraping libraries, Scrapely doesn't work with DOM trees or xpaths\nso it doesn't depend on libraries such as lxml or libxml2. Instead, it uses\nan internal pure-python parser, which can accept poorly formed HTML. The HTML is\nconverted into an array of token ids, which is used for matching the items to\nbe extracted.\n\nScrapely extraction is based upon the Instance Based Learning algorithm [1]_\nand the matched items are combined into complex objects (it supports nested and\nrepeated objects), using a tree of parsers, inspired by A Hierarchical\nApproach to Wrapper Induction [2]_.\n\n.. [1] `Yanhong Zhai , Bing Liu, Extracting Web Data Using Instance-Based Learning, World Wide Web, v.10 n.2, p.113-132, June 2007 \u003chttp://portal.acm.org/citation.cfm?id=1265174\u003e`_\n\n.. [2] `Ion Muslea , Steve Minton , Craig Knoblock, A hierarchical approach to wrapper induction, Proceedings of the third annual conference on Autonomous Agents, p.190-197, April 1999, Seattle, Washington, United States \u003chttp://portal.acm.org/citation.cfm?id=301191\u003e`_\n\nKnown Issues\n============\n\nThe training implementation is currently very simple and is only provided for\nreferences purposes, to make it easier to test Scrapely and play with it. On\nthe other hand, the extraction code is reliable and production-ready. So, if\nyou want to use Scrapely in production, you should use train() with caution and\nmake sure it annotates the area of the page you intended.\n\nAlternatively, you can use the Scrapely command line tool to annotate pages,\nwhich provides more manual control for higher accuracy.\n\nHow does Scrapely relate to `Scrapy`_?\n======================================\n\nDespite the similarity in their names, Scrapely and `Scrapy`_ are quite\ndifferent things. The only similarity they share is that they both depend on\n`w3lib`_, and they are both maintained by the same group of developers (which\nis why both are hosted on the `same Github account`_).\n\nScrapy is an application framework for building web crawlers, while Scrapely is\na library for extracting structured data from HTML pages. If anything, Scrapely\nis more similar to `BeautifulSoup`_ or `lxml`_ than Scrapy.\n\nScrapely doesn't depend on Scrapy nor the other way around. In fact, it is\nquite common to use Scrapy without Scrapely, and viceversa.\n\nIf you are looking for a complete crawler-scraper solution, there is (at least)\none project called `Slybot`_ that integrates both, but you can definitely use\nScrapely on other web crawlers since it's just a library.\n\nScrapy has a builtin extraction mechanism called `selectors`_ which (unlike\nScrapely) is based on XPaths.\n\n\nLicense\n=======\n\nScrapely library is licensed under the BSD license.\n\n.. _Scrapy: http://scrapy.org/\n.. _w3lib: https://github.com/scrapy/w3lib\n.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/\n.. _lxml: http://lxml.de/\n.. _same Github account: https://github.com/scrapy\n.. _slybot: https://github.com/scrapy/slybot\n.. _selectors: http://doc.scrapy.org/en/latest/topics/selectors.html\n.. _nose: http://readthedocs.org/docs/nose/en/latest/\n.. _scrapy@freenode: http://webchat.freenode.net/?channels=scrapy\n.. _Scrapy community: http://scrapy.org/community/\n.. _tox: https://pypi.python.org/pypi/tox\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy%2Fscrapely","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapy%2Fscrapely","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy%2Fscrapely/lists"}