Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nci-gdc/gdcdictionary

Data dictionary for the GDC
https://github.com/nci-gdc/gdcdictionary

core library

Last synced: 2 days ago
JSON representation

Data dictionary for the GDC

Awesome Lists containing this project

README

        

[![Build Status](https://travis-ci.com/NCI-GDC/gdcdictionary.svg?branch=master)](https://travis-ci.com/NCI-GDC/gdcdictionary)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commitlogoColor=white)](https://github.com/pre-commit/pre-commit)

---
# GDC Data Dictionary

The Genomic Data Commons’ (GDC) data dictionary provides the first
level of validation for all data stored in and generated by the
GDC. JSON schemas define all the individual entities (nodes) in the
GDC data model. Moreover, these schemas define all of the
relationships (links) between the nodes. Finally, the schemas define
the valid key-value pairs that can be used to describe the nodes.

- [GDC Data Dictionary](#gdc-data-dictionary)
- [GDC Data Dictionary Structure](#gdc-data-dictionary-structure)
- [Node Properties and Examples](#node-properties-and-examples)
- [Dictionary Changes](#dictionary-changes)
- [Breaking Changes](#breaking-changes)
- [Entity Relation Additions](#entity-relation-additions)
- [Schema Additions](#schema-additions)
- [Cosmetic Corrections](#cosmetic-corrections)
- [Testing](#testing)
- [Versioning](#versioning)
- [Gdcdatamodel2 auto generation](#Gdcdatamodel2-auto-generation)
- [Setup pre-commit hook to check for secrets](#setup-pre-commit-hook-to-check-for-secrets)
- [Contributing](#contributing)

## GDC Data Dictionary Structure

The GDC Data Model is covers all of the nodes within the GDC as well as the relationships between
the different types of nodes. All of the nodes in the data model are strongly typed and individually
defined for a specific data type. For example, submitted files can come in two different forms,
aligned or unaligned; within the model we have two separately defined nodes for
`Submitted Unaligned Reads` and `Submitted Aligned Reads`. Doing such allows for faster querying of
the data model as well as providing a clear and concise representation of the data in the GDC.

Beyond node type, there are also a number of GDC extensions used to further define the nodes within
the data model. Nodes are grouped up into categories that represent broad roles for the node such
as `analysis` or `biospecimen`. Additionally, nodes are defined within their `Program` or `Project`
and have descriptions of their use. All nodes also have a series of `systemProperties`; these
properties are those that will be automatically filled by the system unless otherwise defined by
the user. These basic properties define the node itself but it still needs to be placed into the model.

The model itself is represented as a graph. Within the schema are defined `links`; these links
point from child to parent with Program being the root of the graph. The links also contain a
`backref` that allows for a parent to point to a child. Other features of the link include a
semantic `label` that describes the relationship between the two nodes, a `multiplicity` property
that describes the numeric relationship from the child to the parent, and a requirement property
to define whether a node must have that link. Taken all together the nodes and links create the
directed graph of the GDC Data Model.

## Node Properties and Examples

Each node contains a series of potential key-value pairs (`properties`) that can be used to
characterize the data they represent. Some properties are categorized as `required` or `preferred`.
If a submission lacks a required property, it cannot be accepted. Preferred properties can denote
two things: the property is being highlighted as it has become more desired by the community or
the property is being promoted to required. All properties not designated either `required` or
`preferred` are still sought by GDC, but submissions without them are allowed.

The properties have further validation through their entries. Legal values are defined in each
property. For the most part these are represented in the `enum` categories although some keys,
such as `submitter_id`, will allow any string value as a valid entry. Other numeric properties
can have maximum and minimum values to limit valid entries. For examples of what a valid entry
would look like, each node has a mock submission located in the `examples/valid/` directory.

## Dictionary Changes

The following is an attempt to layout guidelines for the level of
impact of changes to the dictionary by categorizing them into
**Breaking Changes**, **Entity Relation Additions**, **Schema Additions**,
**Cosmetic Corrections**.

### Breaking Changes

Breaking changes are changes to the dictionary such that previously
allowable data is invalid against the new schema, e.g. a **removal** of
part of the dictionary.

N.B. That not all changes classified here as Breaking Changes are
promised to require a data migration. It is possible that no data
exists in the GDC that is invalidated by the change, e.g. making a
field required that has never been left blank. This should be
confirmed against the corpus of data and the userbase should be
notified of a break in backwards-compatibility.

**Breaking Changes include**:
- Renaming/removing anything that is not a description or comment
- Removing an entity schema
- Removing a property's allowed `type`
- Removing a property's allowed `enum` value
- Changing an entity's `category`
- Changing an entity's `unique_keys`
- Changing an entity's `links`, including `label`, `backref`
- Removing a property from an entity schema
- Changing existence requirements
- Adding a property to the `required` list
- Changing link `required` from `false` to `true`
- Changing link `multiplicity` from `one_to_many` or `many_to_one` to `one_to_one`
- Changing link subgroup exclusivity from `false` to `true`

**Handling breaking changes**:

Sometimes it may be best to introduce necessary breaking changes
incrementally. Given you have State A and State B, which are
incomatible, if you can create a State AB that is compatible with
both, you can upgrade to State AB without breaking changes, update
data to be compliant with State B, then upgrade to State B.

1. State A is deployed
2. Upgrade to State AB
3. Update data while State AB is deployed to be valid under State B
4. Upgrade to State B

An example could be: _Introduce required property `color`_:

1. Property `color` does not exist
2. Deploy schema that allows but does not require `color`
3. Add color to all records
4. Deploy schema that requires `color`

### Entity Relation Additions

Additions to the dictionary that create entities or add links between
entities should not be considered breaking changes, however, they
should be carefully considered in context of downstream effects.

**Entity Relation Additions include**:
- Adding a new entity schema
- Adding a new link between entities

**Entity Relation downstream effects**:
- The GDC will have to update the database schema
- Users should be notified of additions

### Schema Additions

The GDC is setup to allow strict additions to properties have minimal
impact on existing data.

**Schema Additions include**:
- New properties
- New allowed types for properties
- New allowed `enum` members for properties

**Schema Addition downstream effects**:
- Users should be notified of additions

### Cosmetic Corrections

Cosmetic corrections are changes that have little to no behavioral
effects.

**Cosmetic Corrections include**:
- Changes to terms
- Changes to documentation
- Schema formatting changes

**Schema Addition downstream effects**:
- No large impacts

### Testing

Commits will automagically be run on TravisCI when a Pull Request is opened.
If you would like to test locally they are run via [tox](https://tox.readthedocs.io/en/latest/)

### Versioning

The GDC Dictionary should
follow [Semantic Versioning](http://semver.org/) by updating the
line in setup.py file to `MAJOR.MINOR.PATCH` accordingly:

1. MAJOR: version when you make incompatible API changes: **Breaking Changes**
- e.g. 1.2.4 -> 2.0.0
2. MINOR: version when you add functionality in a backwards-compatible manner: **Relationship Additions**, **Schema Additions**
- e.g. 1.2.4 -> 1.3.0
3. PATCH: version when you make backwards-compatible bug fixes: **Cosmetic Corrections**
- e.g. 1.2.4 -> 1.2.5

## Gdcdatamodel2 auto generation

Gdcdatamodel2 should be generated automatically on gitlab for each commit push.
The generated python artifact should be in
https://nexus.osdc.io/#browse/browse:pypi-snapshots:gdcdatamodel2

The gitlab pipline also automatically push a new branch to gdcdatamodel2 on github.

If you want, you can also manually run the pipeline to generate a new version.
1. Go to https://gitlab.datacommons.io/nci-gdc/development/gdcdictionary/-/pipelines/new
2. Select the branch/tag of gdcdictionary you want to use, default: develop
3. (optional) the generated version of gdcdatamodel2 should based on the branch/tag you
selected in previous step. But if you want to generate from a different branch, change
`GDCDICTIONARY_TARGET_VERSION_OVERRIDE` in the variables.
4. click `Run Pipeline` button.

## Setup pre-commit hook to check for secrets

We use [pre-commit](https://pre-commit.com/) to setup pre-commit hooks for this repo.
We use [detect-secrets](https://github.com/Yelp/detect-secrets) to search for secrets being committed into the repo.

To install the pre-commit hook, run
```
pre-commit install
```

To update the .secrets.baseline file run
```
detect-secrets scan --update .secrets.baseline
```

`.secrets.baseline` contains all the string that were caught by detect-secrets but are not stored in plain text. Audit the baseline to view the secrets .

```
detect-secrets audit .secrets.baseline
```

## Contributing

Read how to contribute [here](https://github.com/NCI-GDC/gdcdictionary/blob/develop/CONTRIBUTING.md).