https://github.com/scrapinghub/arche
Analyze scraped data
https://github.com/scrapinghub/arche
data data-analysis data-visualization jupyter pandas python3 scrapy
Last synced: 11 days ago
JSON representation
Analyze scraped data
- Host: GitHub
- URL: https://github.com/scrapinghub/arche
- Owner: scrapinghub
- License: mit
- Created: 2019-03-14T19:31:24.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2019-12-09T18:01:19.000Z (over 5 years ago)
- Last Synced: 2025-04-03T15:01:46.261Z (about 1 month ago)
- Topics: data, data-analysis, data-visualization, jupyter, pandas, python3, scrapy
- Language: Python
- Homepage: https://arche.readthedocs.io/
- Size: 27.9 MB
- Stars: 46
- Watchers: 15
- Forks: 18
- Open Issues: 27
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE
Awesome Lists containing this project
README
# Arche
[](https://pypi.org/project/arche)
[](https://pypi.org/project/arche)

[](https://travis-ci.com/scrapinghub/arche)
[](https://codecov.io/gh/scrapinghub/arche)
[](https://github.com/ambv/black)
[](https://github.com/scrapinghub/arche/commits/master)pip install arche
Arche (pronounced *Arkey*) helps to verify scraped data using set of defined rules, for example:
* Validation with [JSON schema](https://json-schema.org/)
* Coverage (items, fields, categorical data, including booleans and enums)
* Duplicates
* Garbage symbols
* Comparison of two jobs
_We use it in Scrapinghub, among the other tools, to ensure quality of scraped data_## Installation
Arche requires [Jupyter](https://jupyter.org/install) environment, supporting both [JupyterLab](https://github.com/jupyterlab/jupyterlab#installation) and [Notebook](https://github.com/jupyter/notebook) UI
For JupyterLab, you will need to properly install [plotly extensions](https://github.com/plotly/plotly.py#jupyterlab-support-python-35)
Then just `pip install arche`
## Why
To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up [Spidermon](https://spidermon.readthedocs.io/en/latest/item-validation.html#with-json-schema)## Developer Setup
pipenv install --dev
pipenv shell
tox## Contribution
Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.