{"id":19900000,"url":"https://github.com/scrapy-plugins/scrapy-jsonschema","last_synced_at":"2025-05-02T22:32:06.725Z","repository":{"id":53591498,"uuid":"79554983","full_name":"scrapy-plugins/scrapy-jsonschema","owner":"scrapy-plugins","description":"Scrapy schema validation pipeline and Item builder using JSON Schema","archived":false,"fork":false,"pushed_at":"2021-03-26T15:32:38.000Z","size":72,"stargazers_count":44,"open_issues_count":1,"forks_count":12,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-04-19T01:32:33.564Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapy-plugins.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGES.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-01-20T11:51:06.000Z","updated_at":"2025-04-18T19:27:50.000Z","dependencies_parsed_at":"2022-08-31T02:00:15.984Z","dependency_job_id":null,"html_url":"https://github.com/scrapy-plugins/scrapy-jsonschema","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-jsonschema","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-jsonschema/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-jsonschema/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapy-plugins%2Fscrapy-jsonschema/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapy-plugins","download_url":"https://codeload.github.com/scrapy-plugins/scrapy-jsonschema/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252116446,"owners_count":21697380,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T20:10:46.825Z","updated_at":"2025-05-02T22:32:05.188Z","avatar_url":"https://github.com/scrapy-plugins.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"=================\nscrapy-jsonschema\n=================\n\n.. image:: https://img.shields.io/pypi/pyversions/scrapy-jsonschema.svg\n       :target: https://pypi.python.org/pypi/scrapy-jsonschema\n\n.. image:: https://img.shields.io/pypi/v/scrapy-jsonschema.svg\n    :target: https://pypi.python.org/pypi/scrapy-jsonschema\n\n.. image:: https://travis-ci.org/scrapy-plugins/scrapy-jsonschema.svg?branch=master\n    :target: https://travis-ci.org/scrapy-plugins/scrapy-jsonschema\n\n.. image:: https://codecov.io/gh/scrapy-plugins/scrapy-jsonschema/branch/master/graph/badge.svg\n  :target: https://codecov.io/gh/scrapy-plugins/scrapy-jsonschema\n\nThis plugin provides two features based on `JSON Schema`_ and the\n`jsonschema`_ Python library:\n\n* a `Scrapy Item`_ definition builder from a JSON Schema definition\n* a `Scrapy item pipeline`_ to validate items against a JSON Schema definition\n\n.. _jsonschema: https://pypi.python.org/pypi/jsonschema\n.. _Scrapy Item: https://docs.scrapy.org/en/latest/topics/items.html\n.. _Scrapy item pipeline: https://docs.scrapy.org/en/latest/topics/item-pipeline.html\n\n\nInstallation\n============\n\nInstall scrapy-jsonschema using ``pip``::\n\n    $ pip install scrapy-jsonschema\n\n\nConfiguration\n=============\n\nAdd ``JsonSchemaValidatePipeline`` by including it in ``ITEM_PIPELINES``\nin your ``settings.py`` file::\n\n   ITEM_PIPELINES = {\n       ...\n       'scrapy_jsonschema.JsonSchemaValidatePipeline': 100,\n   }\n\nHere, priority ``100`` is just an example.\nSet its value depending on other pipelines you may have enabled already.\n\n\nUsage\n=====\n\nLet's assume that you are working with this JSON schema below,\nrepresenting products each requiring a numeric ID, a name, and a non-negative price\n(this example is taken from `JSON Schema`_ website)::\n\n    {\n        \"$schema\": \"http://json-schema.org/draft-04/schema#\",\n        \"title\": \"Product\",\n        \"description\": \"A product from Acme's catalog\",\n        \"type\": \"object\",\n        \"properties\": {\n            \"id\": {\n                \"description\": \"The unique identifier for a product\",\n                \"type\": \"integer\"\n            },\n            \"name\": {\n                \"description\": \"Name of the product\",\n                \"type\": \"string\"\n            },\n            \"price\": {\n                \"type\": \"number\",\n                \"minimum\": 0,\n                \"exclusiveMinimum\": true\n            }\n        },\n        \"required\": [\"id\", \"name\", \"price\"]\n    }\n\nYou can define a ``scrapy.Item`` from this schema by subclassing\n``scrapy_jsonschema.item.JsonSchemaItem``, and setting a ``jsonschema``\nclass attribute set to the schema.\nThis attribute should be a Python ``dict`` -- note that JSON's \"true\" became ``True`` below;\nyou can use Python's ``json`` module to load a JSON Schema as string)::\n\n    from scrapy_jsonschema.item import JsonSchemaItem\n\n\n    class ProductItem(JsonSchemaItem):\n        jsonschema =     {\n            \"$schema\": \"http://json-schema.org/draft-04/schema#\",\n            \"title\": \"Product\",\n            \"description\": \"A product from Acme's catalog\",\n            \"type\": \"object\",\n            \"properties\": {\n                \"id\": {\n                    \"description\": \"The unique identifier for a product\",\n                    \"type\": \"integer\"\n                },\n                \"name\": {\n                    \"description\": \"Name of the product\",\n                    \"type\": \"string\"\n                },\n                \"price\": {\n                    \"type\": \"number\",\n                    \"minimum\": 0,\n                    \"exclusiveMinimum\": True\n                }\n            },\n            \"required\": [\"id\", \"name\", \"price\"]\n        }\n\nYou can then use this item class as any regular Scrapy item\n(notice how fields that are not in the schema raise errors when assigned)::\n\n    \u003e\u003e\u003e item = ProductItem()\n    \u003e\u003e\u003e item['foo'] = 3\n    (...)\n    KeyError: 'ProductItem does not support field: foo'\n\n    \u003e\u003e\u003e item['name'] = 'Some name'\n    \u003e\u003e\u003e item['name']\n    'Some name'\n\nIf you use this item definition in a spider and if the pipeline is enabled,\ngenerated items that do no follow the schema will be dropped.\nIn the (unrealistic) example spider below, one of the items only contains the \"name\",\nand \"id\" and \"price\" are missing::\n\n    class ExampleSpider(scrapy.Spider):\n        name = \"example\"\n        allowed_domains = [\"example.com\"]\n        start_urls = ['http://example.com/']\n\n        def parse(self, response):\n            yield ProductItem({\n                \"name\": response.css('title::text').extract_first()\n            })\n\n            yield ProductItem({\n                \"id\": 1,\n                \"name\": response.css('title::text').extract_first(),\n                \"price\": 9.99\n            })\n\nWhen running this spider, when the item with missing fields is output,\nyou should see these lines appear in the logs::\n\n    2017-01-20 12:34:23 [scrapy.core.scraper] WARNING: Dropped: schema validation failed:\n     id: 'id' is a required property\n    price: 'price' is a required property\n\n    {'name': u'Example Domain'}\n\nThe second item conforms to the schema so it appears as a regular item log::\n\n    2017-01-20 12:34:23 [scrapy.core.scraper] DEBUG: Scraped from \u003c200 http://example.com/\u003e\n    {'id': 1, 'name': u'Example Domain', 'price': 9.99}\n\n\nThe item pipeline also updates Scrapy stats with a few counters, under\n``jsonschema/`` namespace::\n\n    2017-01-20 12:34:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:\n    {...\n     'item_dropped_count': 1,\n     'item_dropped_reasons_count/DropItem': 1,\n     'item_scraped_count': 1,\n     'jsonschema/errors/id': 1,\n     'jsonschema/errors/price': 1,\n     ...}\n    2017-01-20 12:34:23 [scrapy.core.engine] INFO: Spider closed (finished)\n\n\n.. _JSON Schema: http://json-schema.org/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy-plugins%2Fscrapy-jsonschema","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapy-plugins%2Fscrapy-jsonschema","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapy-plugins%2Fscrapy-jsonschema/lists"}