{"id":22147254,"url":"https://github.com/centerforopenscience/scrapi","last_synced_at":"2025-10-26T21:43:52.958Z","repository":{"id":18130078,"uuid":"21210102","full_name":"CenterForOpenScience/scrapi","owner":"CenterForOpenScience","description":"A data processing pipeline that schedules and runs content harvesters, normalizes their data, and outputs that normalized data to a variety of output streams. This is part of the SHARE project, and will be used to create a free and open dataset of research (meta)data. Data collected can be explored at https://osf.io/share/, and viewed at https://osf.io/api/v1/share/search/. Developer docs can be viewed at https://osf.io/wur56/wiki","archived":false,"fork":false,"pushed_at":"2016-06-22T03:05:12.000Z","size":71588,"stargazers_count":41,"open_issues_count":23,"forks_count":45,"subscribers_count":9,"default_branch":"develop","last_synced_at":"2024-04-14T05:19:04.390Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CenterForOpenScience.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-06-25T16:21:08.000Z","updated_at":"2023-07-27T05:58:48.000Z","dependencies_parsed_at":"2022-08-26T18:22:52.006Z","dependency_job_id":null,"html_url":"https://github.com/CenterForOpenScience/scrapi","commit_stats":null,"previous_names":[],"tags_count":80,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CenterForOpenScience%2Fscrapi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CenterForOpenScience%2Fscrapi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CenterForOpenScience%2Fscrapi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CenterForOpenScience%2Fscrapi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CenterForOpenScience","download_url":"https://codeload.github.com/CenterForOpenScience/scrapi/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227642219,"owners_count":17797850,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-01T23:15:36.732Z","updated_at":"2025-10-26T21:43:47.924Z","avatar_url":"https://github.com/CenterForOpenScience.png","language":"Python","readme":"scrapi\n======\n\n```master``` build status: [![Build Status](https://travis-ci.org/CenterForOpenScience/scrapi.svg?branch=master)](https://travis-ci.org/CenterForOpenScience/scrapi)\n\n\n```develop``` build status: [![Build Status](https://travis-ci.org/CenterForOpenScience/scrapi.svg?branch=develop)](https://travis-ci.org/CenterForOpenScience/scrapi)\n\n\n[![Coverage Status](https://coveralls.io/repos/CenterForOpenScience/scrapi/badge.svg?branch=develop)](https://coveralls.io/r/CenterForOpenScience/scrapi?branch=develop)\n[![Code Climate](https://codeclimate.com/github/CenterForOpenScience/scrapi/badges/gpa.svg)](https://codeclimate.com/github/CenterForOpenScience/scrapi)\n\n## Getting started\n\n- To run absolutely everything, you will need to:\n    - Install Python\n      - To check what version you have: $python --version\n    - Install pip to download Python packages\n    - Install Cassandra, or Postgres, or both (optional)\n    - Install requirements\n    - Install Elasticsearch  \n    - Install RabbitMQ (optional)\n- You do not have to install RabbitMQ if you're only running the harvesters locally.\n- Both Cassandra and Postgres aren't really necessary, you can choose which one you'd like, or use both. If you install neither, you can use local storage instead. In your settings, you'll specify a CANONICAL_PROCESSOR, just make sure that one is installed.\n\n### Installing virtualenv and virtualenvwrapper\n\n####  Mac OSX\n\n```bash\n$pip install virtualenv\n$pip install virtualenvwrapper\n```\n\nFor further information on installing virtualenv and virtualenvwrapper:\n[http://docs.python-guide.org/en/latest/dev/virtualenvs/]\n\n\n#### Ubuntu\n\n```bash\n$ sudo apt-get install python-pip python-dev build-essential libxml2-dev libxslt1-dev\n$ pip install virtualenv\n$ sudo pip install virtualenv virtualenvwrapper\n$ sudo pip install --upgrade pip\n```\nCreate a backup of your .bashrc file\n```bash\n$ cp ~/.bashrc ~/.bashrc-org Create a backup of\n$ printf '\\n%s\\n%s\\n%s' '# virtualenv' 'export WORKON_HOME=~/virtualenvs' 'source /usr/local/bin/virtualenvwrapper.sh' \u003e\u003e ~/.bashrc\n```\nEnable the virtual environment\n```bash\n$ source ~/.bashrc\n$ mkdir -p $WORKON_HOME\n$ mkvirtualenv scrapi\n```\nTo exit the virtual environment\n```bash\n$ deactivate\n```\nTo enter the virtual environment\n```bash\n$ workon scrapi\n```\n\n### Forking and cloning scrapi materials from Github\n\n\nCreate a Github account\nFork the scrapi repository to your account\n\nInstall Git\n```bash\n$ sudo apt-get update\n$ sudo apt-get install git\n$ git clone https://github.com/your-username/scrapi\n```\n\n### Installing Postgres\n\nPostgres is required only if \"postgres\" is specified in your settings, or if RECORD_HTTP_TRANSACTIONS is set to ```True```.\n\n#### Mac OSX\n\nBy far, the simplest option is to install the postgres Mac OSX app:\n- http://postgresapp.com/\n\nTo instead install via command line, run:\n\n```bash\n$ brew install postgresql\n$ ln -sfv /usr/local/homebrew/opt/postgresql/*.plist ~/Library/LaunchAgents\n$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist\n```\n\n#### Ubuntu\nInside your scrapi checkout:\n```bash\n$ sudo apt-get update\n$ sudo apt-get install postgresql\n$ sudo service postgresql start\n```\n\n#### Running on Ubuntu\nInside your scrapi checkout:\n```bash\n$ sudo -u postgres createuser your-username\n$ sudo -u postgres createdb -O your-username scrapi\n```\n\n#### Running on Mac OSX\n\nInside your scrapi checkout:\n```bash\n$ createdb scrapi\n$ invoke apidb\n```\n\n\n### Installing Cassandra\n\nCassandra is required only if \"cassandra\" is specified in your settings, or if RECORD_HTTP_TRANSACTIONS is set to ```True```.\n\n_Note: Cassandra requires JDK 7._\n\n#### Mac OSX\n\n```bash\n$ brew install cassandra\n```\n\n#### Ubuntu\n\n1. Check which version of Java is installed by running the following command:\n   ```bash\n   $ java -version\n   ```\n   Use the latest version of Oracle Java 7 on all nodes.\n\n2. Add the DataStax Community repository to the /etc/apt/sources.list.d/cassandra.sources.list\n   ```bash\n   $ echo \"deb http://debian.datastax.com/community stable main\" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list\n   ```\n\n3.  Add the DataStax repository key to your aptitude trusted keys.\n    ```bash\n    $ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -\n    ```\n\n4. Install the package.\n   ```bash\n   $ sudo apt-get update\n   $ sudo apt-get install cassandra\n   ```\n\n#### Running\n\n```bash\n$ cassandra\n```\n\n\nOr, if you'd like your cassandra session to be bound to your current session, run:\n```bash\n$ cassandra -f\n```\n\nand you should be good to go.\n\n\n### Requirements\n\n- Create and enter virtual environment for scrapi, and go to the top level project directory. From there, run\n\n\n#### Ubuntu\n```bash\n$ sudo apt-get install libpq-dev python-dev\n$ pip install -r requirements.txt\n$ pip install -r dev-requirements.txt\n```\n\n#### Mac OSX\n```bash\n$ pip install -r requirements.txt\n```\nOr, if you'd like some nicer testing and debugging utilities in addition to the core requirements, run\n```bash\n$ pip install -r dev-requirements.txt\n```\n\nThis will also install the core requirements like normal.\n\n### Installing Elasticsearch\n\n_Note: Elasticsearch requires JDK 7._\n\n#### Mac OSX\n\n```bash\n$ brew install homebrew/versions/elasticsearch17\n```\n\n#### Ubuntu\n\n1. Install Java\n   ```bash\n   $ sudo apt-get install openjdk-7-jdk \n   ```\n\n2. Download and install the Public Signing Key.\n   ```bash\n   $ wget -qO - https://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -\n   ```\n\n3. Add the ElasticSearch repository to your /etc/apt/sources.list.\n   ```bash\n   $ sudo add-apt-repository \"deb http://packages.elasticsearch.org/elasticsearch/1.4/debian stable main\"\n   ```\n\n4. Install the package\n   ```bash\n   $ sudo apt-get update\n   $ sudo apt-get install elasticsearch\n```\n\n#### Running on Ubuntu\n```bash\n$ sudo service elasticsearch start\n\n```\n\n#### Running on Mac OSX\n\n```bash\n$ elasticsearch\n```\n\n### RabbitMQ (optional)\n\n_Note, if you're developing locally, you do not have to run RabbitMQ!_\n\n#### Mac OSX\n\n```bash\n$ brew install rabbitmq\n```\n\n#### Ubuntu\n\n```bash\n$ sudo apt-get install rabbitmq-server\n```\n\n### Create Databases\nCreate databases for Postgres and Elasticsearch - only for local development!\n\n```bash\n$ invoke reset_all\n```\n\n\n### Settings\n\nYou will need to have a local copy of the settings. Copy local-dist.py into your own version of local.py:\n\n```bash\ncp scrapi/settings/local-dist.py scrapi/settings/local.py\n```\n\nCopy over the api settings:\n\n```bash\ncp api/api/settings/local-dist.py api/api/settings/local.py\n```\n\nIf you installed Cassandra, Postgres, and Elasticsearch earlier, you will want add something like the following configuration to your local.py, based on the databases you have:\n\n```python\nRECORD_HTTP_TRANSACTIONS = True  # Only if cassandra or postgres are installed\n\nRAW_PROCESSING = ['cassandra', 'postgres']\nNORMALIZED_PROCESSING = ['cassandra', 'postgres', 'elasticsearch']\nCANONICAL_PROCESSOR = 'postgres'\nRESPONSE_PROCESSOR = 'postgres'\n```\n\nFor raw and normalized processing, add the databases you have installed. Only add elasticsearch to normalized processing, as it does not have a raw processing module.\n\n```RAW_PROCESSING``` and ```NORMALIZED_PROCESSING``` are both lists, so you can add as many processors as you wish. ```CANONICAL_PROCESSOR``` and ```RESPONSE_PROCESSOR``` both are single processors only.\n\n_note: Cassandra processing will soon be phased out, so we recommend using Postgres for your processing needs. Either one will work for now!_\n\nIf you'd like to use local storage, you will want to make sure your local.py has the following configuration:\n\n```python\nRECORD_HTTP_TRANSACTIONS = False\n\nNORMALIZED_PROCESSING = ['storage']\nRAW_PROCESSING = ['storage']\n```\n\nThis will save all harvested/normalized files to the directory ```archive/\u003csource\u003e/\u003cdocument identifier\u003e```\n\n_note: Be careful with this, as if you harvest too many documents with the storage module enabled, you could start experiencing inode errors_\n\nIf you'd like to be able to run all harvesters, you'll need to [register for a PLOS API key](http://api.plos.org/registration/), a [Harvard Dataverse API Key](https://dataverse.harvard.edu/dataverseuser.xhtml?editMode=CREATE\u0026redirectPage=%2Fdataverse.xhtml), and a [Springer API Key](https://dev.springer.com/signup).\n\nAdd your API keys to the following line to your local.py file:\n```\nPLOS_API_KEY = 'your-api-key-here'\nHARVARD_DATAVERSE_API_KEY = 'your-api-key-here'\nSPRINGER_API_KEY = 'your-api-key-here'\n```\n\n### Running the scheduler (optional)\n\n- from the top-level project directory run:\n\n```bash\n$ invoke beat\n```\n\nto start the scheduler, and\n\n```bash\n$ invoke worker\n```\n\nto start the worker.\n\n\n### Harvesters\nRun all harvesters with\n\n```bash\n$ invoke harvesters\n```\n\nor, just one with\n\n```bash\n$ invoke harvester harvester-name\n```\n\nFor testing local development, running the ```mit``` harvester is recommended.\n\nNote: harvester-name is the same as the defined harvester \"short name\".\n\nInvoke a harvester for a certain start date with the ```--start``` or ```-s```argument. Invoke a harvester for a certain end date with the ```--end``` or ```-e```argument.\n\nFor example, to run a harvester between the dates of March 14th and March 16th 2015, run:\n\n```bash\n$ invoke harvester harvester-name --start 2015-03-14 --end 2015-03-16\n```\n\nEither --start or --end can also be used on their own. Not supplying arguments will default to starting the number of days specified in ```settings.DAYS_BACK``` and ending on the current date.\n\nIf --end is given with no --start, start will default to the number of days specified in ```settings.DAYS_BACK``` before the given end date.\n\n\n### Automated OAI PMH Harvester Creation\nWriting a harvester for inclusion with scrAPI?  If the provider makes their metadata available using the OAI-PMH standard, then [autooai](https://github.com/erinspace/autooai) is a utility that will do most of the work for you.\n\n\n### Working with the OSF\n\nTo configure scrapi to work in a local OSF dev environment:\n\n1. Ensure `'elasticsearch'` is in the `NORMALIZED_PROCESSING` list in `scrapi/settings/local.py`\n1. Run at least one harvester\n1. Configure the `share_v2` alias\n1. Generate the provider map\n\n#### Aliases\n\nMultiple SHARE indices may be used by the OSF. By default, OSF uses the ```share_v2``` index. Activate this alias by running:\n\n```bash\n$ inv alias share share_v2\n```\n\nNote that aliases must be activated before the provider map is generated.\n\n#### Provider Map\n\n```bash\n$ inv alias share share_v2\n$ inv provider_map\n```\n\n#### Delete the Elasticsearch index\n\nTo remove both the ```share``` and ```share_v2``` indices from elasticsearch:\n\n```bash\n$ curl -XDELETE 'localhost:9200/share*'\n```\n\n### Testing\n\n- To run the tests for the project, just type\n\n```bash\n$ invoke test\n```\n\nand all of the tests in the 'tests/' directory will be run.\n\nTo run a test on a single harvester, just type\n```bash\n$ invoke one_test shortname\n```\n\n\n### Pitfalls\n\n#### Installing with anaconda\nIf you're using anaconda on your system at all, using pip to install all requirements from scratch from requirements.txt and dev-requirements.txt results in an Import Error when invoking tests or harvesters.\n\nExample:\n\nImportError: dlopen(/Users/username/.virtualenvs/scrapi2/lib/python2.7/site-packages/lxml/etree.so, 2): Library not loaded: libxml2.2.dylib\n  Referenced from: /Users/username/.virtualenvs/scrapi2/lib/python2.7/site-packages/lxml/etree.so\n  Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 10.0.0\n\nTo fix:\n- run ```pip uninstall lxml```\n- remove the anaconda/bin from your system path in your bash_profile\n- reinstall requirements as usual\n\nAnswer found in [this stack overflow question and answer](http://stackoverflow.com/questions/23172384/lxml-runtime-error-reason-incompatible-library-version-etree-so-requires-vers)\n\n### Institutions!\nScrapi supports the addition of institutions in a separate index (` institutions `). Unlike data stored in the ` share ` indices, institution's metadata is updated\nmuch less frequently, meaning that simple parsers can be used to manually load data from providers instead of using scheduled harvesters.\n\nCurrently, data from [GRID](https://grid.ac/) and [IPEDS](https://nces.ed.gov/ipeds/) is supported:\n- GRID: Provides data on international research facilities. The currently used dataset is ` grid_2015_11_05.json `, which can be found [here](https://grid.ac/downloads) or, for the full dataset, [here](http://files.figshare.com/2409936/grid_2015_11_05.json).  To use this dataset\n    move the file to '/institutions/', or override the file path and/or name on ` tasks.py `. This can be individually loaded using the function ` grid() ` in ` tasks.py `.\n- IPEDS: Provides data on secondary education institutions in the US. The currently used dataset is ` hd2014.csv `, which can be found [here](https://nces.ed.gov/ipeds/Home/UseTheData), by clicking on\n    Survey Data -\u003e Complete data files -\u003e 2014 -\u003e Institutional Characteristics -\u003e Directory information, or can be downloaded directly [here](https://nces.ed.gov/ipeds/datacenter/data/HD2014.zip). This will give you a file named `HD2014.zip`, which can be unzipped into `hd2014.csv` by running ` unzip HD2014.zip `. To use this dataset\n    move the file to '/institutions/', or override the file path and/or name on ` tasks.py `. This can be individually loaded using the function ` ipeds() ` in ` tasks.py `.\n\nRunning ` invoke institutions ` will properly load up institution data into elastic search provided the datasets are provided.\n\n### COS is Hiring!\n\nWant to help save science? Want to get paid to develop free, open source software? [Check out our openings!](http://cos.io/jobs)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcenterforopenscience%2Fscrapi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcenterforopenscience%2Fscrapi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcenterforopenscience%2Fscrapi/lists"}