https://github.com/istresearch/memex-cdr
This repository hosts code and schema information related to the Memex Crawl Data Repository (CDR)
https://github.com/istresearch/memex-cdr
Last synced: 5 months ago
JSON representation
This repository hosts code and schema information related to the Memex Crawl Data Repository (CDR)
- Host: GitHub
- URL: https://github.com/istresearch/memex-cdr
- Owner: istresearch
- Created: 2016-05-23T21:23:45.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2018-10-30T14:28:38.000Z (over 6 years ago)
- Last Synced: 2024-08-14T07:09:20.032Z (8 months ago)
- Language: Python
- Size: 16.6 KB
- Stars: 5
- Watchers: 6
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- jimsghstars - istresearch/memex-cdr - This repository hosts code and schema information related to the Memex Crawl Data Repository (CDR) (Python)
README
# CDR Validation Script
The `cdr_validation.py` script processes a set of JSON lines which have been `gzip`d. It expects that each line within the `gzip`d file contains one CDR formatted object as a string so that it can load it with `json.loads()`.## Crawl Documents
For crawl pages, it checks that each object contains the following keys:* `_id`
* `timestamp`,
* `content_type`
* `crawler`,
* `extracted_metadata`
* `extracted_text`
* `raw_content`
* `team`
* `url`
* `version`which are defined as required fields per the [CDR Schema wiki page](https://memexproxy.com/wiki/display/MPM/CDR+Schema).
## Media Documents
For media documents, it checks that each object's `obj_parent` exists within the given dataset. For example, if the media's `obj_parent` is `A12DVKD12478Z` then `A12DVKD12478Z` must exist as the `_id` within a crawl document contained within the same dataset. Additionally, it verifies that each object contains the following keys:* `_id`
* `timestamp`
* `content_type`
* `obj_original_url`
* `obj_parent`
* `obj_stored_url`
* `team`
* `version`which are defined as required fields per the [CDR Schema wiki page](https://memexproxy.com/wiki/display/MPM/CDR+Schema).
## Executing `cdr_validation.py`
This script requires Python 2.7. To execute it you must provide the path to the input file (`input_file`) and the desired path of the output (`result_file`).```
python cdr_validation.py --input_file=input.gz --result_file=output
```The script returns the number of documents that passed, the number that failed, and the time to execute the script. For example:
```
1006 documents passed.
994 documents failed.
Took 0:00:02.337122
```## Interpreting the `result_file`
The script also writes an output file which is a CSV where the first column is the `_id`, the second column is either `Passed` or `Failed` and if `Failed` there is a third column providing a rationale which is either:1. `Missing parent document (field: obj_parent)`
2. `Missing required fields: content_type version`