An open API service indexing awesome lists of open source software.

https://github.com/lyrasis/cdmtools

Grab data and objects from CONTENTdm API, with an eye toward preparing it for migration into Islandora
https://github.com/lyrasis/cdmtools

contentdm contentdm-api data-migration metadata

Last synced: about 1 month ago
JSON representation

Grab data and objects from CONTENTdm API, with an eye toward preparing it for migration into Islandora

Awesome Lists containing this project

README

          

= Cdmtools

Set of commands/tools to grab metadata and objects from a CONTENTdm instance via https://www.oclc.org/support/services/contentdm/help/customizing-website-help/other-customizations/contentdm-api-reference.en.html[the server and web APIs].

These files are organized in such a way that the source-system-agnostic https://github.com/lyrasis/mdmm[MDMM] be used to further prepare them for migration.

== Installation

=== Dependencies

- You will need Ruby
- You will need a recent-ish version of bundler (`gem install bundler` will install if you don't have it, update it if you do)
- You will need https://github.com/mjordan/cdminspect[cdminspect] installed somewhere this application can find it

=== Steps
Clone this repo.

`cd` into the resulting directory

`bundle install`

=== Configuration

*If you will only work on one project and/or don't plan on contributing code back to this repo...*
You can edit `config/config.yaml` in place to set up your project. When you run commands, this default config location will be used.

*If you will be working on multiple projects, need to keep your config(s) in a place where they can be backed up, or you want to avoid contributing your configs back to this repo...*

Copy `config/config.yaml` to your desired location and edit the copy. Specify the path to the desired config when you run a command, like this:

`exe/cdm --settings -c path/to/your/cdm_config.yaml`

The example `config.yaml` included with the repo is heavily commented and intends to be self-documenting.

== Usage

For the available commands:

`exe/cdm help`

For details on exactly what each command does:

`exe/cdm help [COMMAND]`

*This command currently is the best documentation for each step.* Later, I will add some more info to https://github.com/lyrasis/cdmtools/wiki[the repo wiki].

=== Conceptual outline

CDM has collections.

Collections contain objects.

Objects are simple or compound.

Simple objects will have metadata record plus one associated object file.

Compound objects have child objects. The top-level compound object will have a metadata record. Child objects each have a metadata record (usually sparser than the parent record) and one associated object file.

:NOTE:
----
Document-PDF is treated as a compound object in CDM, but if `cdmprintpdf`=1 in the parent record, we treat this as a simple object, with the print pdf file as the object file.
----

`_cdmobjectinfo` contains the original CDM JSON object info downloaded for each compound object.

`_cdmrecords` contains the original CDM JSON metadata downloaded for each object (top level and child mixed together).

`_migrecords` contains the "migration records", or migrecords, generated for each item. Migrecords are the original record, augmented with information in new fields added to support the processing and migration of the data. Do `exe/cdm help get_top_records` to see the details on how migrecords are generated.

`_objects` contains the object files downloaded for each collection.

https://github.com/lyrasis/mdmm[MDMM] expects collection directories containing `_migrecords` and `_objects` directories.

=== Recommended initial order of commands for working with a new CDM instance

==== Data/metadata profiling

- `get_coll_data`
- `get_dc_mappings`
- `get_field_data`

At this point, you can see the number of collections you are working with and the way the metadata fields have been defined for them.

- `get_pointers`
- `get_top_records`

Check your logfile for any errors or warnings at this point and resolve them before continuing, or your problems will just be compounded.

The tool creates a local JSON file for each successful API call, and tries to be pretty good at not making additional API calls to replace info you already have.

This means you can just re-run the above command to try to re-grab any records that could not be retrieved before. It will work fast and only grab what is missing. It also currently means that, if you want to refresh the records from the source, you will need to delete the local copies in `collalias/_cdmrecords`

Use `--force=true` to re-download records and compoundobject info you already retrieved.

- `get_child_records`

Use `--force=true` to re-download records you already retrieved.

Again, check your logfile for warnings or errors after this step

- `report_error_records`

This will write a CSV report of pointers for which a valid/usable record could not be retrieved from the CDM API.

- `delete_error_records`

Optionally, get rid of these records so they don't cause trouble in the rest of the migration. If restrictions are removed or other problems taken care of, you can re-download just the missing records using `get_top_records` and `get_child_records` with `--force=true`. If new objects have been added you'll need to re-run `get_pointers` before re-grabbing records.

- `finalize_migration_records`

Again, check your logfile for warnings or errors.

`report_mig_error_records` and `delete_mig_error_records` work similarly to the aforementioned CDM-record specific commands, but flag objects with issues like unknown islandora content model, or orphaned pdfpage with no associated file.

Run `exe/cdm` to see some other report options

*At this point, you can use https://github.com/lyrasis/mdmm[MDMM] to handle metadata reporting, cleanup, and remapping.*

==== Object data
- `harvest_objects`

Check logfile for errors/warnings after this step.

Simple objects: harvested file size is compared against cdmfilesize and warning is logged if the values do not match

Document-PDF objects: the single PDF is harvested. We don't have that filesize in the CDM record, so best practice will be to validate these objects outside this process

*Use `exe/cdm help` and `exe/cdm help [COMMAND]` to get more details on helper functions for harvesting and working with objects*

== Contributing

Bug reports and pull requests are welcome in https://github.com/lyrasis/cdmtools[the GitHub repo].

== License

The gem is available as open source under the terms of the https://opensource.org/licenses/MIT[MIT License].