{"id":19448054,"url":"https://github.com/scrapinghub/arche","last_synced_at":"2025-08-02T07:41:41.073Z","repository":{"id":57411296,"uuid":"175686790","full_name":"scrapinghub/arche","owner":"scrapinghub","description":" Analyze scraped data","archived":false,"fork":false,"pushed_at":"2019-12-09T18:01:19.000Z","size":29225,"stargazers_count":46,"open_issues_count":27,"forks_count":17,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-04-25T02:43:29.474Z","etag":null,"topics":["data","data-analysis","data-visualization","jupyter","pandas","python3","scrapy"],"latest_commit_sha":null,"homepage":"https://arche.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapinghub.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-14T19:31:24.000Z","updated_at":"2025-02-09T18:21:31.000Z","dependencies_parsed_at":"2022-08-27T17:11:03.531Z","dependency_job_id":null,"html_url":"https://github.com/scrapinghub/arche","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/scrapinghub/arche","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Farche","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Farche/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Farche/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Farche/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapinghub","download_url":"https://codeload.github.com/scrapinghub/arche/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Farche/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268348786,"owners_count":24236306,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-02T02:00:12.353Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-analysis","data-visualization","jupyter","pandas","python3","scrapy"],"created_at":"2024-11-10T16:23:32.859Z","updated_at":"2025-08-02T07:41:40.986Z","avatar_url":"https://github.com/scrapinghub.png","language":"Python","readme":"# Arche\n\n[![PyPI](https://img.shields.io/pypi/v/arche.svg)](https://pypi.org/project/arche)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/arche.svg)](https://pypi.org/project/arche)\n![GitHub](https://img.shields.io/github/license/scrapinghub/arche.svg)\n[![Build Status](https://travis-ci.com/scrapinghub/arche.svg?branch=master)](https://travis-ci.com/scrapinghub/arche)\n[![Codecov](https://img.shields.io/codecov/c/github/scrapinghub/arche.svg)](https://codecov.io/gh/scrapinghub/arche)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)\n[![GitHub commit activity](https://img.shields.io/github/commit-activity/m/scrapinghub/arche.svg)](https://github.com/scrapinghub/arche/commits/master)\n\n    pip install arche\n\nArche (pronounced *Arkey*) helps to verify scraped data using set of defined rules, for example:\n  * Validation with [JSON schema](https://json-schema.org/)\n  * Coverage (items, fields, categorical data, including booleans and enums)\n  * Duplicates\n  * Garbage symbols\n  * Comparison of two jobs\n  \n_We use it in Scrapinghub, among the other tools, to ensure quality of scraped data_\n\n## Installation\n\nArche requires [Jupyter](https://jupyter.org/install) environment, supporting both [JupyterLab](https://github.com/jupyterlab/jupyterlab#installation) and [Notebook](https://github.com/jupyter/notebook) UI\n\nFor JupyterLab, you will need to properly install [plotly extensions](https://github.com/plotly/plotly.py#jupyterlab-support-python-35)\n\nThen just `pip install arche`\n\n## Why\nTo check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up [Spidermon](https://spidermon.readthedocs.io/en/latest/item-validation.html#with-json-schema)\n\n## Developer Setup\n\n\tpipenv install --dev\n\tpipenv shell\n\ttox\n\n## Contribution\nAny contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapinghub%2Farche","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapinghub%2Farche","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapinghub%2Farche/lists"}