{"id":19665488,"url":"https://github.com/zytedata/zyte-autoextract","last_synced_at":"2025-04-28T22:31:11.423Z","repository":{"id":44938538,"uuid":"182260561","full_name":"zytedata/zyte-autoextract","owner":"zytedata","description":"Python clients for Zyte AutoExtract API","archived":false,"fork":false,"pushed_at":"2022-01-17T12:40:56.000Z","size":281,"stargazers_count":40,"open_issues_count":14,"forks_count":7,"subscribers_count":64,"default_branch":"master","last_synced_at":"2025-04-18T18:59:18.428Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zytedata.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGES.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-19T12:21:08.000Z","updated_at":"2025-01-05T17:53:23.000Z","dependencies_parsed_at":"2022-09-04T19:11:12.459Z","dependency_job_id":null,"html_url":"https://github.com/zytedata/zyte-autoextract","commit_stats":null,"previous_names":["scrapinghub/scrapinghub-autoextract"],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zytedata%2Fzyte-autoextract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zytedata%2Fzyte-autoextract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zytedata%2Fzyte-autoextract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zytedata%2Fzyte-autoextract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zytedata","download_url":"https://codeload.github.com/zytedata/zyte-autoextract/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251397577,"owners_count":21583034,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T16:23:05.891Z","updated_at":"2025-04-28T22:31:10.932Z","avatar_url":"https://github.com/zytedata.png","language":"Python","readme":"================\nzyte-autoextract\n================\n\n.. image:: https://img.shields.io/pypi/v/zyte-autoextract.svg\n   :target: https://pypi.python.org/pypi/zyte-autoextract\n   :alt: PyPI Version\n\n.. image:: https://img.shields.io/pypi/pyversions/zyte-autoextract.svg\n   :target: https://pypi.python.org/pypi/zyte-autoextract\n   :alt: Supported Python Versions\n\n.. image:: https://github.com/zytedata/zyte-autoextract/workflows/tox/badge.svg\n   :target: https://github.com/zytedata/zyte-autoextract/actions\n   :alt: Build Status\n\n.. image:: https://codecov.io/github/zytedata/zyte-autoextract/coverage.svg?branch=master\n   :target: https://codecov.io/gh/zytedata/zyte-autoextract\n   :alt: Coverage report\n\nPython client libraries for `Zyte Automatic Extraction API`_.\nIt allows to extract product, article, job posting, etc.\ninformation from any website - whatever the API supports.\n\nCommand-line utility, asyncio-based library and a simple synchronous wrapper\nare provided by this package.\n\nLicense is BSD 3-clause.\n\n.. _Zyte Automatic Extraction API: https://www.zyte.com/data-extraction/\n\n\nInstallation\n============\n\n::\n\n    pip install zyte-autoextract\n\nzyte-autoextract requires Python 3.6+ for CLI tool and for\nthe asyncio API; basic, synchronous API works with Python 3.5.\n\nUsage\n=====\n\nFirst, make sure you have an API key. To avoid passing it in ``api_key``\nargument with every call, you can set ``ZYTE_AUTOEXTRACT_KEY``\nenvironment variable with the key.\n\nCommand-line interface\n----------------------\n\nThe most basic way to use the client is from a command line.\nFirst, create a file with urls, an URL per line (e.g. ``urls.txt``).\nSecond, set ``ZYTE_AUTOEXTRACT_KEY`` env variable with your\nZyte Automatic Extraction API key (you can also pass API key as ``--api-key`` script\nargument).\n\nThen run a script, to get the results::\n\n    python -m autoextract urls.txt --page-type article --output res.jl\n\n.. note::\n    The results can be stored in an order which is different from the input\n    order. If you need to match the output results to the input URLs, the\n    best way is to use ``meta`` field (see below); it is passed through,\n    and returned as-is in ``row[\"query\"][\"userQuery\"][\"meta\"]``.\n\nIf you need more flexibility, you can customize the requests by creating\na JsonLines file with queries: a JSON object per line. You can pass any\nZyte Automatic Extraction options there. Example - store it in ``queries.jl`` file::\n\n    {\"url\": \"http://example.com\", \"meta\": \"id0\", \"articleBodyRaw\": false}\n    {\"url\": \"http://example.com/foo\", \"meta\": \"id1\", \"articleBodyRaw\": false}\n    {\"url\": \"http://example.com/bar\", \"meta\": \"id2\", \"articleBodyRaw\": false}\n\nSee `API docs`_ for a description of all supported parameters in these query\ndicts. API docs mention batch requests and their limitation\n(no more than 100 queries at time); these limits don't apply to the queries.jl\nfile (i.e. it may have millions of rows), as the command-line script does\nits own batching.\n\n.. _API docs: https://docs.zyte.com/automatic-extraction.html\n\nNote that in the example ``pageType`` argument is omitted; ``pageType``\nvalues are filled automatically from ``--page-type`` command line argument\nvalue. You can also set a different ``pageType`` for a row in ``queries.jl``\nfile; it has a priority over ``--page-type`` passed in cmdline.\n\nTo get results for this ``queries.jl`` file, run::\n\n    python -m autoextract --intype jl queries.jl --page-type article --output res.jl\n\nProcessing speed\n~~~~~~~~~~~~~~~~\n\nEach API key has a limit on RPS. To get your URLs processed faster you can\ntune concurrency options: batch size and a number of connections.\n\nBest options depend on the RPS limit and on websites you're extracting\ndata from. For example, if your API key has a limit of 3RPS, and average\nresponse time you observe for your websites is 10s, then to get to these\n3RPS you may set e.g. batch size = 2, number of connections = 15 - this\nwould allow to process 30 requests in parallel.\n\nTo set these options in the CLI, use ``--n-conn`` and ``--batch-size``\narguments::\n\n    python -m autoextract urls.txt --page-type articles --n-conn 15 --batch-size 2 --output res.jl\n\nIf too many requests are being processed in parallel, you'll be getting\nthrottling errors. They are handled by CLI automatically, but they make\nextraction less efficient; please tune the concurrency options to\nnot hit the throttling errors (HTTP 429) often.\n\nYou may be also limited by the website speed. Zyte Automatic Extraction tries not to hit\nany individual website too hard, but it could be better to limit this on\na client side as well. If you're extracting data from a single website,\nit could make sense to decrease the amount of parallel requests; it can ensure\nhigher success ratio overall.\n\nIf you're extracting data from multiple websites, it makes sense to spread the\nload across time: if you have websites A, B and C, don't send requests in\nAAAABBBBCCCC order, send them in ABCABCABCABC order instead.\n\nTo do so, you can change the order of the queries in your input file.\nAlternatively, you can pass ``--shuffle`` options; it randomly shuffles\ninput queries before sending them to the API:\n\n    python -m autoextract urls.txt --shuffle --page-type articles --output res.jl\n\nRun ``python -m autoextract --help`` to get description of all supported\noptions.\n\nErrors\n~~~~~~\n\nThe following errors could happen while making requests:\n\n- Network errors\n- `Request-level errors`_\n    - Authentication failure\n    - Malformed request\n    - Too many queries in request\n    - Request payload size is too large\n- `Query-level errors`_\n    - Downloader errors\n    - Proxy errors\n    - ...\n\nSome errors can be retried while others can't.\n\nFor example,\nyou can retry a query with a Proxy Timeout error\nbecause this is a temporary error\nand there are chances that this response will be different\nwithin the next retries.\n\nOn the other hand,\nit makes no sense to retry queries that return a 404 Not Found error\nbecause the response is not supposed to change if retried.\n\n.. _Request-level errors: https://docs.zyte.com/automatic-extraction.html#request-level\n.. _Query-level errors: https://docs.zyte.com/automatic-extraction.html#query-level\n\nRetries\n~~~~~~~\n\nBy default, we will automatically retry Network and Request-level errors.\nYou could also enable Query-level errors retries\nby specifying the ``--max-query-error-retries`` argument.\n\nEnable Query-level retries to increase the success rate\nat the cost of more requests being performed\nif you are interested in a higher success rate.\n\n.. code-block::\n\n    python -m autoextract urls.txt --page-type articles --max-query-error-retries 3 --output res.jl\n\nFailing queries are retried\nuntil the max number of retries or a timeout is reached.\nIf it's still not possible to fetch all queries without errors,\nthe last available result is written to the output\nincluding both queries with success and the ones with errors.\n\nSynchronous API\n---------------\n\nSynchronous API provides an easy way to try Zyte Automatic Extraction.\nFor production usage asyncio API is strongly recommended. Currently the\nsynchronous API doesn't handle throttling errors, and has other limitations;\nit is most suited for quickly checking extraction results for a few URLs.\n\nTo send a request, use ``request_raw`` function; consult with the\n`API docs`_ to understand how to populate the query::\n\n    from autoextract.sync import request_raw\n    query = [{'url': 'http://example.com.foo', 'pageType': 'article'}]\n    results = request_raw(query)\n\nNote that if there are several URLs in the query, results can be returned in\narbitrary order.\n\nThere is also a ``autoextract.sync.request_batch`` helper, which accepts URLs\nand page type, and ensures results are in the same order as requested URLs::\n\n    from autoextract.sync import request_batch\n    urls = ['http://example.com/foo', 'http://example.com/bar']\n    results = request_batch(urls, page_type='article')\n\n.. note::\n    Currently request_batch is limited to 100 URLs at time only.\n\nasyncio API\n-----------\n\nBasic usage is similar to the sync API (``request_raw``),\nbut asyncio event loop is used::\n\n    from autoextract.aio import request_raw\n\n    async def foo():\n        query = [{'url': 'http://example.com.foo', 'pageType': 'article'}]\n        results1 = await request_raw(query)\n        # ...\n\nThere is also ``request_parallel_as_completed`` function, which allows\nto process many URLs in parallel, using both batching and multiple\nconnections::\n\n    import sys\n    from autoextract.aio import request_parallel_as_completed, create_session\n    from autoextract import ArticleRequest\n\n    async def extract_from(urls):\n        requests = [ArticleRequest(url) for url in urls]\n        async with create_session() as session:\n            res_iter = request_parallel_as_completed(requests,\n                                        n_conn=15, batch_size=2,\n                                        session=session)\n            for fut in res_iter:\n                try:\n                    batch_result = await fut\n                    for res in batch_result:\n                        # do something with a result, e.g.\n                        print(json.dumps(res))\n                except RequestError as e:\n                    print(e, file=sys.stderr)\n                    raise\n\n``request_parallel_as_completed`` is modelled after ``asyncio.as_completed``\n(see https://docs.python.org/3/library/asyncio-task.html#asyncio.as_completed),\nand actually uses it under the hood.\n\nNote ``from autoextract import ArticleRequest`` and its usage in the\nexample above. There are several Request helper classes,\nwhich simplify building of the queries.\n\n``request_parallel_as_completed`` and ``request_raw`` functions handle\nthrottling (http 429 errors) and network errors, retrying a request in\nthese cases.\n\nCLI interface implementation (``autoextract/__main__.py``) can serve\nas an usage example.\n\nRequest helpers\n---------------\n\nTo query Zyte Automatic Extraction you need to create a dict with request parameters, e.g.::\n\n    {'url': 'http://example.com.foo', 'pageType': 'article'}\n\nTo simplify the library usage and avoid typos, zyte-autoextract\nprovides helper classes for constructing these dicts::\n\n* autoextract.Request\n* autoextract.ArticleRequest\n* autoextract.ProductRequest\n* autoextract.JobPostingRequest\n\nYou can pass instances of these classes instead of dicts everywhere when\nrequests dicts are accepted. So e.g. instead of writing this::\n\n    query = [{\"url\": url, \"pageType\": \"article\"} for url in urls]\n\nYou can write this::\n\n    query = [Request(url, pageType=\"article\") for url in urls]\n\nor this::\n\n    query = [ArticleRequest(url) for url in urls]\n\nThere is one difference: ``articleBodyRaw`` parameter is set to ``False``\nby default when Request or its variants are used, while it is ``True``\nby default in the API.\n\nYou can override API params passing a dictionary with extra data using the\n``extra`` argument. Note that it will overwrite any previous configuration\nmade using standard attributes like ``articleBodyRaw`` and ``fullHtml``.\n\nExtra parameters example::\n\n    request = ArticleRequest(\n        url=url,\n        fullHtml=True,\n        extra={\n            \"customField\": \"custom value\",\n            \"fullHtml\": False\n        }\n    )\n\nThis will generate a query that looks like this::\n\n    {\n        \"url\": url,\n        \"pageType\": \"article\",\n        \"fullHtml\": False,  # our extra parameter overrides the previous value\n        \"customField\": \"custom value\"  # not a default param but defined even then\n    }\n\n\nContributing\n============\n\n* Source code: https://github.com/zytedata/zyte-autoextract\n* Issue tracker: https://github.com/zytedata/zyte-autoextract/issues\n\nUse tox_ to run tests with different Python versions::\n\n    tox\n\nThe command above also runs type checks; we use mypy.\n\n.. _tox: https://tox.readthedocs.io\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzytedata%2Fzyte-autoextract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzytedata%2Fzyte-autoextract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzytedata%2Fzyte-autoextract/lists"}