https://github.com/microbiomedata/sample-annotator
NMDC Sample Annotator
https://github.com/microbiomedata/sample-annotator
annotator environment-ontology environmentontology envo linkml microbiome microbiomedata nmdc ontologies sample-metadata samples
Last synced: 6 months ago
JSON representation
NMDC Sample Annotator
- Host: GitHub
- URL: https://github.com/microbiomedata/sample-annotator
- Owner: microbiomedata
- Created: 2021-07-10T02:39:17.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2025-03-14T13:52:49.000Z (7 months ago)
- Last Synced: 2025-03-27T12:07:23.775Z (6 months ago)
- Topics: annotator, environment-ontology, environmentontology, envo, linkml, microbiome, microbiomedata, nmdc, ontologies, sample-metadata, samples
- Language: Python
- Homepage: https://microbiomedata.github.io/sample-annotator/static/intro.html
- Size: 7.27 MB
- Stars: 5
- Watchers: 3
- Forks: 9
- Open Issues: 33
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
README
[](https://microbiomedata.github.io/sample-annotator/static/intro.html)
# NMDC Sample Annotator API
## Installing
## Command Line
```bash
poetry run annotate-sample -R examples/outputs/report.tsv examples/gold.json
```### Details:
```shell
poetry run annotate-sample --help
```````Usage: annotate-sample [OPTIONS] SAMPLEFILE
Annotate a file of samples, producing a "repaired"/enhanced sample file as
output, together with a reportThe input file must be a JSON fine containing an array of dicts
Options:
-v, --validateonly / -g, --generate
Just validate / generate output (default:
generate)
-s, --output TEXT JSON for tidied samples
-R, --report-file TEXT report file
-G, --googlemaps-api-key-path TEXT
path to file containing google maps API KEY
-B, --bioportal-api-key-path TEXT
path to file containing bioportal API KEY
--help Show this message and exit.
```## What is it?
This is a python and flask API for performing annotation of samples from semi-structured or untidy data
The API takes as input a JSON object or dictionary representing a simple sample, where each key is a metadata field
It will attempt to tidy and infer missing data according to a specified schema (currently MIxS)
### If you have Google Map credentials:
```bash
poetry run annotate-sample -G config/googlemaps-api-key.txt -R examples/report.tsv examples/gold.json
```This will transform input such as:
```json
[
{
"id": "gold:Gb0108335",
"community": "microbial communities",
"depth": "0.0 m",
"ecosystem": "Environmental",
"ecosystem_category": "Terrestrial",
"ecosystem_subtype": "Wetlands",
"ecosystem_type": "Soil",
"env_broad_scale": "ENVO:00000446",
"env_local_scale": "ENVO:00000489",
"env_medium": "ENVO:00000134",
"geo_loc_name": "Sweden: Kiruna",
"habitat": "Thawing permafrost",
"identifier": "studying carbon transformations",
"lat_lon": "68.3534 19.0472",
"location": "from the Arctic",
"mod_date": "15-MAY-20 10.04.19.473000000 AM",
"name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D",
"ncbi_taxonomy_name": "permafrost metagenome",
"sample_collection_site": "Palsa",
"specific_ecosystem": "Permafrost",
"study_description": "A fundamental challenge of microbial environmental science is to understand how earth systems will respond to climate change. A parallel challenge in biology is to unverstand how information encoded in organismal genes manifests as biogeochemical processes at ecosystem-to-global scales. These grand challenges intersect in the need to understand the glocal carbon (C) cycle, which is both mediated by biological processes and a key driver of climate through the greenhouse gases carbon dioxide (CO2) and methane (CH4). A key aspect of these challenges is the C cycle implications of the predicted dramatic shrinkage in northern permafrost in the coming century.",
"type": "nmdc:Biosample"
}
]
```into:
```json
[
{
"id": "gold:Gb0108335",
"community": "microbial communities",
"depth": {
"has_numeric_value": 0.0,
"has_raw_value": "0.0 m",
"has_unit": "metre"
},
"ecosystem": "Environmental",
"ecosystem_category": "Terrestrial",
"ecosystem_subtype": "Wetlands",
"ecosystem_type": "Soil",
"elev": {
"has_numeric_value": 359,
"has_unit": "meter"
},
"env_broad_scale": "ENVO:00000446",
"env_local_scale": "ENVO:00000489",
"env_medium": "ENVO:00000134",
"geo_loc_name": "Sweden: Kiruna",
"habitat": "Thawing permafrost",
"identifier": "studying carbon transformations",
"lat_lon": {
"latitude": 68.3534,
"longitude": 19.0472
},
"location": "from the Arctic",
"mod_date": "15-MAY-20 10.04.19.473000000 AM",
"name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D",
"ncbi_taxonomy_name": "permafrost metagenome",
"sample_collection_site": "Palsa",
"specific_ecosystem": "Permafrost",
"study_description": "A fundamental challenge of microbial environmental science is to understand how earth systems will respond to climate change. A parallel challenge in biology is to unverstand how information encoded in organismal genes manifests as biogeochemical processes at ecosystem-to-global scales. These grand challenges intersect in the need to understand the glocal carbon (C) cycle, which is both mediated by biological processes and a key driver of climate through the greenhouse gases carbon dioxide (CO2) and methane (CH4). A key aspect of these challenges is the C cycle implications of the predicted dramatic shrinkage in northern permafrost in the coming century."
}
]
```Differences between input and output:
* measurement fields are normalized
* information inferred from lat_lon (currently only `elev`)
* TODO: ENVO from text mining
* TODO: annotation sufficiency score
* TODO: more...### Validation reports
These are created as report objects, and exported to pandas dataframes for basic statistical aggregation. See tests for
detailsExample report:
| description | severity | field | was_repaired | category |
|-------------------------------------------------------|----------|-----------------------|-----------------------|----------|
| No package specified | 1 ||| Category.MissingCore |
| No checklist specified | 1 ||| Category.Unclassified |
| Key not underscored: total particulate carbon | 1 || True | Category.Unclassified |
| Invalid field: id | 1 ||| Category.UnknownField |
| Alias used: total_particulate_carbon => tot_part_carb | 1 || True | Category.Unclassified |
| Parsed unit-value: 2.0 metre | 1 ||| Category.Unclassified |
| Missing unit 5 | 1 ||| Category.Unclassified |
| Skipping geo-checks | 0 ||| Category.Unclassified |## API Docs
TODO: readthedocs
## Testing
Currently the best way to understand this code is to understand the tests
* [tests](tests)
* [inputs](tests/inputs)
* [test_sample_info.yaml](tests/inputs/test_sample_info.yaml)This contains 'fake' samples that are intended to test validation and repair
## Schema Validation
See the [schema](sample_annotator/model/schema) folder -- this contains a copy of the LinkML rendering of the MIxS
schema from [mixs-source](https://github.com/cmungall/mixs-source) which will later be integrated by GSC## Modules
* [geo](sample_annotator/geolocation)
- currently this requires a googlemaps API key
- TODO: rewrite to use ORNL Identify
* [measurements](sample_annotator/measurements)
- uses quantulum
- TODO: use http://units.ontodev.com/
* [text mining](sample_annotator/text_mining)
- basic repair
- NER using fields such as study fields
* [ontology](sample_annotator/ontology)
- LinkML enumerations to ontologiesEach module will take care of different aspects
For example, the measurement module will normalized all fields in the schema with range QuantityValue
E.g. Input:
```yaml
sample:
id: TEST:1
alt: 2m
...
```Repair Output:
```yaml
sample:
id: TEST:1
alt:
has_numeric_value: 2.0
has_raw_value: 2m
has_unit: metre
...
```## Starting the web API
- TODO: write flask code