Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/psu-libraries/researcher-metadata

Penn State University's faculty and research metadata repository
https://github.com/psu-libraries/researcher-metadata

Last synced: 4 days ago
JSON representation

Penn State University's faculty and research metadata repository

Awesome Lists containing this project

README

        

[![Maintainability](https://api.codeclimate.com/v1/badges/74b6e379ab6fc90ab149/maintainability)](https://codeclimate.com/github/psu-stewardship/researcher-metadata/maintainability) [![Test Coverage](https://api.codeclimate.com/v1/badges/74b6e379ab6fc90ab149/test_coverage)](https://codeclimate.com/github/psu-stewardship/researcher-metadata/test_coverage)

![Penn State Libraries Logo](https://metadata.libraries.psu.edu/psu_libraries.png)

# Researcher Metadata Database (RMD)

This is the repository for a Ruby on Rails application built for Penn State University Libraries to
gather metadata about Penn State faculty and the research that they conduct and publish. The application
provides a means of semi-automated data importing from several different sources, an administrative
interface for Penn State Libraries admins to manage and curate the data, and an API for end-users and
other applications to access the data. One specific use case for the API is to provide all of the data
needed to produce the kind of profile web page for each faculty member that might be found in the faculty
directory on a department website. The application also provides an interface for faculty users to see
an example of what a profile page about them would look like, and a management interface for them to
adjust some preferences with regard to how data appears in their profile. This interface includes features
that allow ORCiD users to automatically send metadata from RMD to populate their ORCiD records. It also
includes features that support Penn State's open access policy/initiative by allowing faculty users to
submit metadata about the open access status of each of their publications.

## Data Importing and Updating

For this application to be relevant and useful, it is important for the data in the production database
to be kept relatively "clean" and current. For some of the external data sources, the process of extracting
data from the source and importing it into RMD has been fully automated, and the process is performed on a
regular basis. For other sources, the process involves several manual steps, and we try to update data from
these sources at least a couple of times per year.

The process for data that must be imported manually generally consists of three steps:
1. Gather new data in the form of correctly-formatted file(s) from the source
1. Place the new data file(s) in the conventional location on the production server
1. Run the Rake task to automatically import all of the data from the new file(s)

### Data Sources

We import data from a number of other web applications and databases that contain data about Penn State
faculty and research, and we're continuing to add new data sources as we find or gain access to them. Some
of the types of records in our database are currently imported from a single source, and others may be
imported from multiple sources.

Below is a list of the data sources along with the types of records that are imported from each:

1. **Activity Insight** - This is web application/database made by Digital Measures where faculty enter a
wide variety of data about themselves mainly for the purpose of job/performance review, attaining tenure,
etc. We use this application's [REST API](https://webservices.digitalmeasures.com/login/service/v4) (API key
required) for directly importing data. A cron job automatically runs this import in production once per weekday
beginning at around 10:15 PM, and the import usually takes around 8 hours to finish. We import the
following types of records from Activity Insight:
- authorships
- contributor_names
- education_history_items
- performances
- performance_screenings
- presentations
- presentation_contributions
- publications
- activity_insight_oa_files
- users
- user_performances

1. **Pure** - This is a web application/database made by Elsevier. It contains data about Penn State
researchers, their published research, and the organizations within Penn State to which they belong.
This data mostly has to do with scientific research that is published in peer-reviewed journals, so
we don't get much data about faculty in the arts and humanities from this source as opposed to Activity
Insight which provides data about faculty across the whole university. This application also has a
well-documented [REST API](https://pure.psu.edu/ws/api/524/api-docs/index.html) (API key
required) to which we have access. We use this API for directly importing data. A cron job automatically
runs this import in production once per week beginning at around 8:00 AM on Saturday, and the import
may take multiple hours to finish. We import the following types of records from Pure:
- authorships
- contributor_names
- journals
- organizations
- publications
- publication_taggings
- publishers
- tags
- user_organization_memberships
- users

1. **eTD** - This is a web application/database developed by the Penn State Libraries that facilitates
the submission and archival of PhD dissertations and Masters theses in digital format by graduate students.
Our main reason for importing metadata from this source is to be able to show the graduate student advising
that each faculty member has done. Because the application currently has no API, we don't have a way to
automate the importing of data. Currently, we obtain an SQL dump of the eTD database, load this dump into a
local MySQL database and export several .csv files which we then import into our database. We import the
following types of records from eTD:
- committee_memberships
- etds

1. **Penn State News RSS feeds** - The Penn State News website publishes many
[RSS feeds](https://news.psu.edu/rss-feeds), and we import news story metadata directly from several of
them whenever a story involves a specific Penn State Faculty member. A cron job automatically runs this
import in production once per hour. We import the following types of records
from news.psu.edu:
- news_feed_items

1. **Penn State LDAP** - We import some data for existing user records from Penn State's directory of people.
A cron job runs this import in production once per hour. ORCIDs are obtained from this import.

1. **Web of Science** - We performed a one-time import of a static set of publication and grant data from
Web of Science for publications that were published from 2013 to 2018. We obtained a copy of this data on
a physical disk. We import the following types of records from Web of Science:
- authorships
- contributor_names
- grants
- publications
- research_funds

1. **National Science Foundation** - We import grant data that we download from the National Science
Foundation [website](https://nsf.gov/awardsearch/download.jsp) in the form of XML files. We import the
following types of records from NSF:
- grants
- researcher_funds

1. **Open Access Button** - We import information about open access copies of publications that is provided by
Open Access Button via their web [API](https://openaccessbutton.org/api). There are three different importers
configured to run with cron. One imports Open Access Button metadata for publications with DOIs using Open Access
Button's DOI search endpoint. The other gathers metadata using Open Access Button's title search endpoint. Searching by DOI is
faster, so the import with DOIs only takes several days to complete. This process is run every Sunday at 8:00 AM. Searching by title
can be slow, so this process runs on the 1st and 15th day of every month at 8:00 AM. It sometimes takes more than a week to complete.
The final import only imports open access button info for new publications. In other words, any publication without an `open_access_button_last_checked_at`
timestamp. This is fairly quick and run at 10:00 PM every Sunday. It is meant to run after all the latest Activity insight and Pure data has been imported,
and before the weekly open access emails go out. We import the following types of records from Open Access Button:
- open_access_locations

1. **Unpaywall** - Very similarly to Open Access Button, we import metadata about open access copies of
publications from Unpaywall's web [API](https://unpaywall.org/products/api). Like the Open Access Button
import, we search Unpaywall by DOI or title (if the publication has no DOI). However, this is not
split into two imports like the Open Access Button import, since the title search for Unpaywall is much
faster. Much of the data imported from Unpaywall overlaps with data imported from Open Access Button, but each
source may provide some metadata that is not provided by the other. In general, Unpaywall provides richer
metadata than Open Access Button. In addition to importing metadata about any open access copies of a
publication, this import also updates the open access status on publication records in RMD. In production,
a cron job automatically runs this import once per week beginning at 8:00 PM on Tuesdays. It can take
multiple days to finish due to API rate limiting. Just like the Open Access Button import, the Unpaywall import
has a second weekly import on Sunday at 10:00 PM where Unpaywall data is only imported for new publications.
We import the following types of records from Unpaywall:
- open_access_locations

1. **Penn State Law School repositories** - We import publication metadata from the repositories maintained
by the Penn State Law School at University Park and the Dickinson School of Law at Carlisle via the Open
Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Dickinson maintains the
[IDEAS repository](https://ideas.dickinsonlaw.psu.edu/), and Penn State Law maintains the
[Penn State Law eLibrary](https://elibrary.law.psu.edu/). Part of our reason for importing metadata from
these sources is to facilitate onboarding the law school faculty into Activity Insight. In the future, we
may not need to import data from these sources since new data may eventually be available via Activity Insight.
This import not currently scheduled to run automatically. We import the following types of records from these
repositories:
- authorships
- contributor_names
- publications
- open_access_locations

### Obtaining New Data
While much of our data importing is fully automated, some of it involves parsing files that are manually
exported/downloaded from a data source. By convention, we place those files in the `db/data/` directory
within the application and give them the names that are defined in `lib/tasks/imports.rake`. This directory
in the repository is ignored by revision control, and on the application servers it is shared between
application releases. The data sources that involve a manual export/import process are described below.

#### eTD
The process for obtaining and preparing eTD data for import involves several steps.
1. Obtain an SQL dump of the production database for the graduate school instance of the eTD app.
1. Create a MySQL database locally and load in the eTD production dump.
1. The production database dump contains a view called `student_submissions` that presents most of the eTD data
that we need to import, but it's missing one column that we need. So, in our local database we need to add
one additional column called `submission_public_id` to the `student_submissions` view:

```sql
CREATE ALGORITHM=UNDEFINED DEFINER=`root`@`localhost` SQL SECURITY DEFINER VIEW `metadata_import_view`
AS SELECT
`s`.`id` AS `submission_id`,
`s`.`semester` AS `submission_semester`,
`s`.`year` AS `submission_year`,
`s`.`created_at` AS `submission_created_at`,
`s`.`status` AS `submission_status`,
`s`.`access_level` AS `submission_acccess_level`,
`s`.`title` AS `submission_title`,
`s`.`abstract` AS `submission_abstract`,
`s`.`public_id` AS `submission_public_id`,
`authors`.`access_id` AS `access_id`,
`authors`.`first_name` AS `first_name`,
`authors`.`middle_name` AS `middle_name`,
`authors`.`last_name` AS `last_name`,
`authors`.`alternate_email_address` AS `alternate_email_address`,
`authors`.`psu_email_address` AS `psu_email_address`,
`p`.`name` AS `program_name`,
`d`.`name` AS `degree_name`,
`d`.`description` AS `degree_description`,
`id`.`id_number` AS `inv_disclosure_num`,group_concat(concat(`cm`.`name`,_utf8'|',nullif(`cm`.`email`,_utf8'|'),_utf8' ',`cr`.`name`) order by `cr`.`id` ASC separator ' || ') AS `committee_members`
FROM ((((((`submissions` `s` left join `invention_disclosures` `id` on((`s`.`id` = `id`.`submission_id`))) left join `authors` on((`s`.`author_id` = `authors`.`id`))) left join `programs` `p` on((`s`.`program_id` = `p`.`id`))) left join `degrees` `d` on((`s`.`degree_id` = `d`.`id`))) left join `committee_members` `cm` on((`s`.`id` = `cm`.`submission_id`))) left join `committee_roles` `cr` on((`cm`.`committee_role_id` = `cr`.`id`))) group by `s`.`id`;
```
NOTE: The default SQL mode in MySQL 5.7 and later enables several modes including `ONLY_FULL_GROUP_BY` that were turned off by default in older versions. `ONLY_FULL_GROUP_BY` causes the `student_submissions` view to be rejected. Here's some background on this issue:

https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sql-mode-changes
https://www.percona.com/blog/2019/05/13/solve-query-failures-regarding-only_full_group_by-sql-mode/

To restore operation for the `student_submissions` view we've been turning off SQL modes when starting a new MySQL
session like so: `mysql.server start --sql-mode=''` To verify no modes were activated go into MySQL and do the following:
``` mysql> SELECT @@sql_mode; ```
Note: This approach is somewhat heavy-handed, but the results from the `student_submissions` view seem okay. A more precise approach would be to not invoke `ONLY_FULL_GROUP_BY` when starting a new MySQL session and leave all
the other modes activated.

1. After starting up MySQL with SQL modes turned off, dump the new view of the eTD data to a tab-separated file with `echo 'SELECT submission_id, submission_semester, submission_year, submission_status, submission_acccess_level, submission_title, access_id, first_name, middle_name, last_name, degree_name, submission_public_id FROM metadata_import_view' | mysql -B -u root etdgradrailprod > etds.tsv`,
substituting your local database name if you called it something different.
1. Open the TSV file in vim. You'll probably see that the file is littered with `^M` control characters that we
need to remove. Enter the command `:%s/^M//g` to remove all of them (NOTE: you'll need to literally type ctrl-vm in
that vim command, not carrot-M). Write the file.
1. Open the TSV file in Excel or Numbers and export it as a UTF-8 encoded CSV file. Save it in the data import
directory in this project (`db/data/`) as `etds.csv`.
1. Dump the committee data with `echo 'SELECT submission_id, email, committee_role_id FROM committee_members' | mysql -B -u root etdgradrailprod > etd_committees.tsv`
again substituting the name of your local database if necessary.
1. Again, open `committees.tsv` in Excel or Numbers and export as a UTF-8 encoded CSV file, `db/data/committees.csv`.

#### National Science Foundation
In the `lib/utilities/` directory in this repository, there is a utility script for automatically downloading grant
data from the National Science Foundation website and preparing it for import by decompressing the files and placing
them in the correct location.

#### Web of Science
We may be able to acquire more Web of Science data on another physical disk again in the future. In the past,
the individual data files have been so large that disk space on the production server is a limiting factor for how
many files we can prepare for import at one time. The general procedure has been to upload one large file to
the server, import the data from that file, delete that file from the server, and then upload the next file
and repeat until all of the files have been imported.

### Importing New Data
Once updated data files have been obtained (if applicable) and the necessary API keys have been installed,
importing new data is just a matter of running the appropriate rake task. These tasks are all defined in
`lib/tasks/imports.rake`. An individual task is defined for importing each type of data from each source. We also
define a single task that imports all types of data from all sources, `rake import:all`. In practice there's no
need to run this task in production since much of the importing is scheduled to run automatically. This task serves
mostly to help you import data in development, and to give you an idea of the order in which the individual imports
should be generally be run so as to get the most complete set of data in one pass. All of these tasks are designed
to be idempotent given the same source data and database state. For the data sources that require manual export/import,
the individual import tasks will occassionally need to be run in production. While it's sometimes necessary to import
data directly from the sources in development for QA/testing, it's much faster to establish a useful development
database by loading a dump of the production data.

### Identifying Duplicate Publication Data
Because we import metadata about research publications from more than one source, and because duplicate entries
sometime exist even within the same data source, we need a means of finding multiple records in our database that
represent the same publication after we have finished importing data. This helps to ensure that users don't receive
duplicate data when they query our own API. There is logic built into the processes that import publication metadata
that will search for possible duplicate publications that already exist and group them as each new publication
record is imported. This allows admin users to later review the groups and pick which record in each group to keep
while discarding the rest. Subsequent imports of the same data will then not recreate the discarded duplicates.

Whenever suspected duplicate publication records are grouped as they're being imported, those records are also
sometimes automatically hidden so that they do not appear as duplicates in API responses, user profiles, etc. until
the duplication can be resolved. Because the publication data that we import from Pure is relatively clean, reliable,
and free of duplication compared to some of the other data sources, we _don't_ automatically hide publications
that have been imported from Pure when they're added to a duplicate group, but we _do_ automatically hide such
publications imported from less reliable sources. This allows the data from the Pure import to be available to our
users even while possible duplication remains to be resolved. Whenever a duplicate publication group is merged,
the resulting publication is once again automatically made visible if it had been previously hidden.

### Merging Duplicate Publication Records
Whenever we import a new publication, we create a record in two different tables in the database. We create a record
for the publication itself which contains the publication's metadata, and we create a record of the import which
contains information about the source of the import (the name of the source, and the unique identifier for the
publication within that source). The record of the import has a foreign key to the record of the publication.

Then, whenever we run a publication import, for each publication in the input data, the first thing that we do is
look for an import record that has the same source name and unique identifier. If one exists, then we know that we
already imported the publication from this source, and we may or may not update the existing publication record
depending on whether or not the record has been updated by an administrator. If no import record exists, we create
new records as described above.

When we discover two publication records that are duplicates, each of those will have its own import record as well.
For example, Publication Record 1 has an associated import record with the import_source being "Activity Insight"
and the source_identifier being "123". Publication Record 2 is a duplicate of publication record 1, and it has an
associated import record with the import_source being "Pure" and the source_identifier being "789". Whenever an admin
user merges these two publication records, the following happens:
1. The admin user picks one publication record to keep (probably the one with the most complete/accurate metadata).
1. We take the import record from the publication record that we're not keeping and reassign it to the publication record that we are keeping.
1. We delete the publication record that we're not keeping.
1. We flag the publication record that we kept as having been manually modified by an admin user because we assume that the admin has picked the record with the best data (and possibly also made corrections to the chosen data after the merge), and we don't want future imports to overwrite this manually curated data.

So in the example, if we merge the duplicate publications and decide to keep Publication Record 2 (originally
imported from Pure), then Publication Record 2 will now have both import records - the one from Pure with ID
"789" and the one from Activity Insight with ID "123" - attached to it, and Publication Record 1 (originally
imported from Activity Insight) will be deleted. Then when we reimport publications from Activity Insight and
we come to this publication in the source data, we'll find the import record for Activity Insight attached to
Publication Record 2, and we won't create a new record even though Publication Record 1 has been deleted from RMD.

Occasionally publication records that are not actually duplicates appear similar enough that they are mistakenly
grouped as potential duplicates by our duplicate identification process. In this situation, it's possible for admin
users to select and group publications as a way of indicating that they have been reviewed and have been determined
to not be duplicates even though they look similar. This will prevent the same publications from automatically being
grouped as potential duplicates again in the future.

#### Auto-merging
The task of manually inspecting possible duplicate publication records and merging them is somewhat tedious. To help
reduce the amount of labor necessary to curate the publication metadata, we have decided that suspected duplicates can be
automatically merged under some circumstances.

Often when duplicate groups containing one publication import from
Pure and one import from another source are merged, the Pure import is the record that is chosen to be kept, and
the data in that record needs little or no manual curation since the data from Pure is generally accurate and complete.
Since a large proportion of duplicate groups end up containing exactly one publication imported from Pure and exactly
one publication imported from Activity Insight, we've created a process by which all such groups can be automatically
merged at once. This process is run as a rake task, `rake auto_merge:duplicate_pubs`. We know that a very small
percentage of publications that are automatically grouped as suspected duplicates are not actually duplicate records.
This means that whenever we perform auto-merging, we're accepting that a small number of false-positive publication
matches are actually being merged when they shouldn't be. We deemed the amount of labor saved by this automation
to be worth the small amount of data that we'll lose from occasionally merging non-duplicate publications by accident.

If two publications have the same DOI, and are grouped as duplicates, they will likely be merged by an admin.
To save some manual merging, we've created a process that will merge grouped publications if they have the same DOI
and some of their metadata either matches exactly or matches very closely. We also have a process that will make an attempt
at merging publications that have been grouped, but one or both of the publications do _not_ have a DOI. The merging criteria
is more strict in this case, and checks that most of the publication's metadata is matching. Both of these processes can be
run at the same time with the rake task, `rake auto_merge:duplicate_pubs_on_matching`. The logic
should be alterred by a programmer if users feel the matching should be less strict. When merging, this process tries
to determine which data is best to keep between the two records. This is done, in some cases, with somewhat complicated logic.

Both of the above rake tasks are run every Monday at 2:00 AM and 2:30 AM respectively. They are meant to be run
just before the weekly open access emails go out.

### Import Logic
In general, data imports create a new record if no matching record already exists. If a matching record does already
exist, then the import may update the attributes of that record. However, for records that can be fully or partially
updated by admin users of this app, we keep track of whether or not a record has been modified in any way by an admin.
If a record has been modified by an admin user, then subsequent data imports no longer update that record because we assume
that any changes that were made manually by an admin user are corrections or additions that we don't want to overwrite.
In a few circumstances, some metadata fields will still get updated for some of the data models even if the record
has been updated by an admin user. This is for data that is generally not alterred by admins, and should reflect its source value.
Data imports never delete whole records from our database - even if formerly present records have been removed from the
data source.

Importing `user` records involves some additional logic since we import user data from two sources (Pure and Activity
Insight). We take Activity Insight to be the authority on user data when it is present, so an import of a user's data
from Activity Insight will overwrite existing data for that user that has already been imported from Pure, but not
vice versa.

What constitues a "match" for the sake of finding and updating existing records depends on the type of record. For
example, we uniquely identify users by their Penn State WebAccess ID. For most other records, we record their unique
ID within the data source.

### Open Access Publication vs. Non-open Access Publications
All publication types are imported during the publication import from Activity Insight and Pure. Journal articles,
conference proceedings, books, editorials, abstracts, and many others are being imported as publications. An open access
publication is defined as any publication that has a publication_type to which Open Access policy applies. Non-open access
publications (which are sometimes referred to as "Other Works" in the UI and other_publications in the code) are defined
as any publication that does not have a publication_type to which Open Access policy applies.
Open Access and Non-open Access publications are displayed in separate interfaces of the user profile interface, public
profile interface, and profile API endpoint. All publications are displayed together in the publication API endpoint.
Non-open access publications are not included in the open access workflow. However, they can be exported to ORCiD if
they have a url. All publications behave the same in the Admin interface.

### Activity Insight Publication Export
Admins can export publications to Activity Insight through the "Organizations" resource. To do this, admins must go to the organizations
index page and view the details of a specific organization by clicking the "i" on the right side of the datatable. Then, view the
organization's publications by clicking the "View Publications" link in the details. Here, admins can filter which publications they
would like to export from this screen. Once filtered down to the desired publications, they should click the "Export to Activity Insight" button.
This button will take admins to a page that tells them how many publications they are exporting and from which
organization. From here, publications can either be exported to Activity Insight's beta environment for testing, or the production environment.

The export job is handled in its own process via delayed_job. The output from Activity Insight's API is written to
`log/ai_publication_export.log` off of the application's root directory. Depending on how many publications are being exported,
the process can take awhile.

Publications exported to Activity Insight store the records' `RMD_ID` in Activity Insight. To avoid the cyclical nature of exporting and then
reimporting the same records, Activity Insight records with an `RMD_ID` are skipped during the RMD's Activity Insight import.
Once a record has been exported to Activity Insight, it is flagged so it cannot be exported again.

## API
### Gems
The RMD API is intended to conform to the Swagger 2.0 specification. As such, we're leveraging several gems to simplify the API development workflow, but remain true to the Swagger/Open API standard:
- **Rswag**: a DSL within rspec used to generate a manifest file which is then used to generate interactive API documentation. It also tests a rails API against its OpenAPI (Swagger) description of end-points, models, and query parameters.
- **jsonapi-serializer**: is what we're using for serialization.

### NPM Packages
- **swagger-ui-dist**: serves up live, interactive API documentation as part of our site. Basically provides a sandbox where users can read about the endpoints and then "try them out". It builds all of this from the API file we build from Rswag.

### Suggested API Development Workflow
- Document the new endpoint using the Rswag DSL. Then generate the swagger manifest with `rake rswag:specs:swaggerize`. Pro tip: Until you get the hang of the DSL, you can use the [interactive swagger editor](https://editor.swagger.io/) for guidance and to check that you have valid Swagger 2.0. Try it out by pasting in the current API file as a starting point: ``` pbcopy < public/api_docs/v1/swagger.yaml ```
- Once you've got valid Swagger documentation of the new endpoint, try running rspec on those Rswag test files. You can add headers, parameters, request bodies, etc. with the Rswag DSL. Once the proper inputs are set and the swagger definitions match up with what the API is returning, you'll start seeing green tests.
- In addition to the Rswag specs, we've also been using request specs to "integration" test our endpoints and params. See: ```spec/requests/api/v1/users_spec.rb``` for an example. The Rswag specs are good for ensuring the DSL/documentation jives with the actual enpoints, but the request specs allow us to be more thorough.
- Once specs are passing it's a good idea to manually test the endpoint using the swagger-ui: ```/api_docs```

## ORCID Integration
### Background
Since much of the data that is stored in the Researcher Metadata database is the same data that
people would use to populate their ORCID records, it's useful for us to offer users the ability to
automatically write their data to ORCID. In order to do this, our application must be authorized to
update the ORCID record on behalf of the user.

Penn State has a system that allows users to connect their ORCID records to their Penn State Access
Accounts. This system uses the usual 3-legged OAuth process to obtain the user's ORCID iD and a
read/write access token from ORCID. The access token so obtained has permission to read data from the
user's ORCID record and also to write data to their record via ORCID's API. This access token and ORCID
iD are stored in a database managed by Penn State central IT. The ORCID iD is then published in the
user's Penn State LDAP record.

### Researcher Metadata Database Integration
#### Current Workflow
In our production environment, a job runs hourly that queries Penn State LDAP for each user in our database
and saves their ORCID iD if it is present in their LDAP record. We then use the presence of the ORCID iD
that we obtained in this way to determine if the user can connect their Researcher Metadata profile to
their ORCID record. If we've obtained an ORCID iD from LDAP, then they are able to make this new
connection.

If such a user chooses to connect their Researcher Metadata profile to their ORCID record, then they
go through the 3-legged OAuth process again, and we receive our own read/write access token for the user's
ORCID record. We save the data that we obtained directly from ORCID in this way (including the access
token and the user's ORCID iD) to the user's record in our database and use it for all subsequent ORCID
API calls on behalf of that user. It should be noted that if a user has multiple ORCID accounts, then
the ORCID iD that they've connected to their Penn State Access Account could be different from the ORCID
iD that they've connected to their Researcher Metadata profile.

#### Ideal Workflow
Currently, it's not possible for us to access the ORCID OAuth access tokens that are stored in the database
at Penn State central IT. If it were, then we could simply use those tokens instead of obtaining our own.
This would simplify our process and eliminate the possibility that different ORCID iDs for a given user
could end up being linked to the two different systems. This would also make the workflow much more streamlined
for our users and prevent Penn State from needing to pay for an additional ORCID integration. However, it
will probably add a little complexity in terms of caching access tokens and attempting to retrieve a new
token if we find the one in our cache to no longer be valid.

The best compromise for now is to only allow users who have linked their ORCID records to their Penn State
Access Accounts to also link them to their Researcher Metadata profile. This ensures that if we are later
able to use the centralized access tokens, then one of those tokens will exist for every user who has linked
their ORCID record to their Researcher Metadata profile. It also gives users one more incentive to link their
ORCID record to their Penn State Access Account (which we want them to do regardless).

## RMD and Open Access
Penn State has adopted a [policy](https://openaccess.psu.edu/) requiring most research published by faculty
at the University to be made freely available to the public. Part of the purpose of RMD is to support this
policy by encouraging faculty members to comply and by helping to provide them with the means and guidance
to do so. RMD is capable of determining (based on its own data) which faculty members have authored publications
that are subject to the policy and that may not be in compliance, and it's capable of sending a notification by
email to each of these faculty members with a list of the publications that may require action. This list of
publications links back to the user interface in RMD that faculty use to manage their profile preferences. This
interface shows a list of all of the user's publications and the open access status of each. For each publication
that has an unknown open access status, there are several actions that a user may take in order to comply with the
policy including:
1. providing a URL to an open access version of their work that has already been published on the web
1. Using RMD's builtin ScholarSphere deposit workflow to upload an open access version of their work to ScholarSphere
1. visiting [ScholarSphere](https://scholarsphere.psu.edu/) and depositing an open access version of their work
1. submitting a waiver if an open access version of their work cannot be published for some reason

Before notifying users in this way, we attempt to obtain as much open access status information as possible so
that we're not bothering people about publications that are already open access. To do this, we use the DOIs that
we have imported from various sources to query Open Access Button and Unpaywall and attempt to obtain a URL for an existing
open access version of the work. The task for sending the notifications is `rake email_notifications:send_all_open_access_reminders`.
As the name indicates, it sends an email to *every* user who meets the criteria for receiving one. The task for sending emails with
a maximum limit to how many can be sent is `rake email_notifications:send_capped_open_access_reminders[number]`. This task is run
weekly on Monday at 3:00 AM. Ideally, we want to have collected as much publication data throughout the week as possible. Then,
before the emails go out, make sure that any new publications have had their open access locations checked with Open Access Button
and Unpaywall, and automerged if possible. All of this is currently automated with cronjobs.
There is also a task for sending a test email containing mock data to a specific email address for the
purposes of testing/demonstration: `rake email_notifications:test_open_access_reminder[[email protected]]`.

For each user, we record the time when they were last sent an open access reminder email, and for each publication
we also record this time in the user's authorship record so that we know which publications we've already
reminded the user about. The logic for determining who, out of a given set of users, should receive an email
includes a check of this timestamp so that we never email the same person more often than once every six months
regardless of how often we run the emailing task.

If a user takes any of the possible actions to comply with the open access policy for a publication, then the
result will be recorded in RMD, the open access status of the publication will change, and the user will no
longer be notified about the publication.

### Activity Insight Open Access Post-Print Publication file

While users will not be directed here from our open access reminder emails, users also have the option to upload an open access
file to Activity Insight. These files are imported into RMD during the regular Activity Insight import. A handful of
automated processes are used to help determine if the file and its associated metadata is fit for deposit to ScholarSphere.
A workflow interface complements these processes to allow admins to do manual work where automation was not successful. The
files are deposited in ScholarSphere at the end of this workflow.

## ScholarSphere Open Access Deposit Workflow

Users can upload their open access publications within RMD's profile section from the ScholarSphere deposit workflow.
There are several steps to this process:

1. The user uploads their files to be sent to ScholarSphere
1. RMD analyzes those files to try to determine if the publication uploaded is a Published or Accepted version
1. The user can confirm which version the uploaded publication is
1. RMD then tries to grab permissions data for the publication from Open Access Button
1. The user is presented a form with prefilled fields using permissions data and data stored in RMD to be reviewed and/or editted
1. The user submits the form and the files and metadata are sent to ScholarSphere

## Dependencies
This application requires PostgreSQL for a data store, and it has been tested with PostgreSQL 9.5 and 10.10. Some functionality requires the [pg_trgm module](https://www.postgresql.org/docs/9.6/pgtrgm.html) to be enabled by running `CREATE EXTENSION pg_trgm;` as the PostgreSQL superuser for the application's database.

## Docker Compose Local Setup
```
cp .env.example .env
docker compose build
docker compose up
```

---

This project was developed by the The Pennsylvania State University Libraries Digital Scholarship and Repository Development team in collaboration with [West Arete](https://westarete.com).