https://github.com/datacite/pidcheck
A generic PID Health Checker
https://github.com/datacite/pidcheck
Last synced: 5 months ago
JSON representation
A generic PID Health Checker
- Host: GitHub
- URL: https://github.com/datacite/pidcheck
- Owner: datacite
- License: mit
- Created: 2018-02-04T18:34:42.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-08-14T16:36:29.000Z (almost 3 years ago)
- Last Synced: 2025-09-11T10:23:57.307Z (10 months ago)
- Language: Python
- Homepage:
- Size: 51.8 KB
- Stars: 2
- Watchers: 8
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
PidCheck
--------
PidCheck is a generic crawler for extracting data about PiD's from landing pages and doing some calculation on the health of the link.
It is based upon the [Scrapy](https://scrapy.org/) framework for doing most of the hard work.
It is configured for broad crawling to hit multiple domains and does this in a polite way by default.
While the project actually includes a basic non redis backed version, the architecture
is designed to have a redis store for both feeding urls to check and storing the data for further
processing.
# Getting started with docker
For starting a version of the crawler and a redis, you can just do regular
`docker-compose up`
For debugging purposes you can use the seperate debug compose file
`docker-compose -f docker-compose.debug.yml up`
With this running you can push data into redis using redis-cli:
`src/redis-cli -p 6379 lpush pidcheck:start_urls '{ "pid": "msk0-7250", "url": "https://blog.datacite.org/datacite-hiring-another-application-developer/" }'`
## Settings
The following are important settings that you can override with environment variables.
It is possible to use a .env file for changing these settings as well.
* USER_AGENT - Specify a user agent so sites can identify your bot. default: pidcheck
* LOG_LEVEL - Standard python logging levels can be set, default: INFO
* REDIS_HOST - Host for specifying a different redis* default: redis
* REDIS_PORT - Port for specifiying a different redis* default: 6379
*Note specifying a different redis, you will want to use only the crawler docker image and
not the redis one.*
# Usage
## Seeding
The redis has a SEED_URL key in the format of: "pidcheck:start_urls".
You can push directly using the redis-cli:
```src/redis-cli -p 6379 lpush pidcheck:start_urls '{ "pid": "msk0-7250", "url": "https://blog.datacite.org/datacite-hiring-another-application-developer/" }'```
For conveniance there is also a scripts/seed.py that can take either a json lines format with each line being a json object:
```'{ "pid": "msk0-7250", "url": "https://blog.datacite.org/datacite-hiring-another-application-developer/" }'```
or accepts a CSV file with the columns being: pid, url
Example:
```python scripts/seed.py myurls.csv```
## Data Dump
To retrieve the results from the scraping you can use the dump.py script to output the data:
```python scripts/dump.py mydata.csv```
# Development
## Requirements
* Python 3
## Install python libraries
`pip install -r requirements.txt`
## Scrapy
It is a scrapy project so regular scrapy crawl commands should work.