{"id":22727114,"url":"https://github.com/microbiomedata/sample-annotator","last_synced_at":"2025-04-13T21:43:07.150Z","repository":{"id":37897187,"uuid":"384594767","full_name":"microbiomedata/sample-annotator","owner":"microbiomedata","description":"NMDC Sample Annotator","archived":false,"fork":false,"pushed_at":"2025-03-14T13:52:49.000Z","size":7628,"stargazers_count":5,"open_issues_count":33,"forks_count":9,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-27T12:07:23.775Z","etag":null,"topics":["annotator","environment-ontology","environmentontology","envo","linkml","microbiome","microbiomedata","nmdc","ontologies","sample-metadata","samples"],"latest_commit_sha":null,"homepage":"https://microbiomedata.github.io/sample-annotator/static/intro.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microbiomedata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-10T02:39:17.000Z","updated_at":"2025-03-14T13:52:49.000Z","dependencies_parsed_at":"2025-01-14T16:03:52.231Z","dependency_job_id":"6f3594b3-58e0-4c84-8feb-4f8cd5e151bb","html_url":"https://github.com/microbiomedata/sample-annotator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microbiomedata%2Fsample-annotator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microbiomedata%2Fsample-annotator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microbiomedata%2Fsample-annotator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microbiomedata%2Fsample-annotator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microbiomedata","download_url":"https://codeload.github.com/microbiomedata/sample-annotator/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248788867,"owners_count":21161726,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotator","environment-ontology","environmentontology","envo","linkml","microbiome","microbiomedata","nmdc","ontologies","sample-metadata","samples"],"created_at":"2024-12-10T17:09:50.350Z","updated_at":"2025-04-13T21:43:07.123Z","avatar_url":"https://github.com/microbiomedata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Documentation Status](https://readthedocs.org/projects/ansicolortags/badge/?version=latest)](https://microbiomedata.github.io/sample-annotator/static/intro.html)\n\n# NMDC Sample Annotator API\n\n## Installing\n\n## Command Line\n\n```bash\npoetry run annotate-sample -R examples/outputs/report.tsv examples/gold.json\n```\n\n### Details:\n\n```shell\npoetry run annotate-sample --help\n````\n\n```Usage: annotate-sample [OPTIONS] SAMPLEFILE\n\n  Annotate a file of samples, producing a \"repaired\"/enhanced sample file as\n  output, together with a report\n\n  The input file must be a JSON fine containing an array of dicts\n\nOptions:\n  -v, --validateonly / -g, --generate\n                                  Just validate / generate output (default:\n                                  generate)\n  -s, --output TEXT               JSON for tidied samples\n  -R, --report-file TEXT          report file\n  -G, --googlemaps-api-key-path TEXT\n                                  path to file containing google maps API KEY\n  -B, --bioportal-api-key-path TEXT\n                                  path to file containing bioportal API KEY\n  --help                          Show this message and exit.\n```\n\n## What is it?\n\nThis is a python and flask API for performing annotation of samples from semi-structured or untidy data\n\nThe API takes as input a JSON object or dictionary representing a simple sample, where each key is a metadata field\n\nIt will attempt to tidy and infer missing data according to a specified schema (currently MIxS)\n\n### If you have Google Map credentials:\n\n```bash\npoetry run annotate-sample -G config/googlemaps-api-key.txt -R examples/report.tsv examples/gold.json\n```\n\nThis will transform input such as:\n\n```json\n[\n  {\n    \"id\": \"gold:Gb0108335\",\n    \"community\": \"microbial communities\",\n    \"depth\": \"0.0 m\",\n    \"ecosystem\": \"Environmental\",\n    \"ecosystem_category\": \"Terrestrial\",\n    \"ecosystem_subtype\": \"Wetlands\",\n    \"ecosystem_type\": \"Soil\",\n    \"env_broad_scale\": \"ENVO:00000446\",\n    \"env_local_scale\": \"ENVO:00000489\",\n    \"env_medium\": \"ENVO:00000134\",\n    \"geo_loc_name\": \"Sweden: Kiruna\",\n    \"habitat\": \"Thawing permafrost\",\n    \"identifier\": \"studying carbon transformations\",\n    \"lat_lon\": \"68.3534 19.0472\",\n    \"location\": \"from the Arctic\",\n    \"mod_date\": \"15-MAY-20 10.04.19.473000000 AM\",\n    \"name\": \"Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D\",\n    \"ncbi_taxonomy_name\": \"permafrost metagenome\",\n    \"sample_collection_site\": \"Palsa\",\n    \"specific_ecosystem\": \"Permafrost\",\n    \"study_description\": \"A fundamental challenge of microbial environmental science is to understand how earth systems will respond to climate change. A parallel challenge in biology is to unverstand how information encoded in organismal genes manifests as biogeochemical processes at ecosystem-to-global scales. These grand challenges intersect in the need to understand the glocal carbon (C) cycle, which is both mediated by biological processes and a key driver of climate through the greenhouse gases carbon dioxide (CO2) and methane (CH4). A key aspect of these challenges is the C cycle implications of the predicted dramatic shrinkage in northern permafrost in the coming century.\",\n    \"type\": \"nmdc:Biosample\"\n  }\n]\n```\n\ninto:\n\n```json\n[\n  {\n    \"id\": \"gold:Gb0108335\",\n    \"community\": \"microbial communities\",\n    \"depth\": {\n      \"has_numeric_value\": 0.0,\n      \"has_raw_value\": \"0.0 m\",\n      \"has_unit\": \"metre\"\n    },\n    \"ecosystem\": \"Environmental\",\n    \"ecosystem_category\": \"Terrestrial\",\n    \"ecosystem_subtype\": \"Wetlands\",\n    \"ecosystem_type\": \"Soil\",\n    \"elev\": {\n      \"has_numeric_value\": 359,\n      \"has_unit\": \"meter\"\n    },\n    \"env_broad_scale\": \"ENVO:00000446\",\n    \"env_local_scale\": \"ENVO:00000489\",\n    \"env_medium\": \"ENVO:00000134\",\n    \"geo_loc_name\": \"Sweden: Kiruna\",\n    \"habitat\": \"Thawing permafrost\",\n    \"identifier\": \"studying carbon transformations\",\n    \"lat_lon\": {\n      \"latitude\": 68.3534,\n      \"longitude\": 19.0472\n    },\n    \"location\": \"from the Arctic\",\n    \"mod_date\": \"15-MAY-20 10.04.19.473000000 AM\",\n    \"name\": \"Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D\",\n    \"ncbi_taxonomy_name\": \"permafrost metagenome\",\n    \"sample_collection_site\": \"Palsa\",\n    \"specific_ecosystem\": \"Permafrost\",\n    \"study_description\": \"A fundamental challenge of microbial environmental science is to understand how earth systems will respond to climate change. A parallel challenge in biology is to unverstand how information encoded in organismal genes manifests as biogeochemical processes at ecosystem-to-global scales. These grand challenges intersect in the need to understand the glocal carbon (C) cycle, which is both mediated by biological processes and a key driver of climate through the greenhouse gases carbon dioxide (CO2) and methane (CH4). A key aspect of these challenges is the C cycle implications of the predicted dramatic shrinkage in northern permafrost in the coming century.\"\n  }\n]\n```\n\nDifferences between input and output:\n\n* measurement fields are normalized\n* information inferred from lat_lon (currently only `elev`)\n* TODO: ENVO from text mining\n* TODO: annotation sufficiency score\n* TODO: more...\n\n### Validation reports\n\nThese are created as report objects, and exported to pandas dataframes for basic statistical aggregation. See tests for\ndetails\n\nExample report:\n\n| description                                           | severity | field                 | was_repaired          | category |\n|-------------------------------------------------------|----------|-----------------------|-----------------------|----------|\n| No package specified                                  | 1        ||| Category.MissingCore  |\n| No checklist specified                                | 1        ||| Category.Unclassified |\n| Key not underscored: total particulate carbon         | 1        || True                  | Category.Unclassified |\n| Invalid field: id                                     | 1        ||| Category.UnknownField |\n| Alias used: total_particulate_carbon =\u003e tot_part_carb | 1        || True                  | Category.Unclassified |\n| Parsed unit-value: 2.0 metre                          | 1        ||| Category.Unclassified |\n| Missing unit 5                                        | 1        ||| Category.Unclassified |\n| Skipping geo-checks                                   | 0        ||| Category.Unclassified |\n\n## API Docs\n\nTODO: readthedocs\n\n## Testing\n\nCurrently the best way to understand this code is to understand the tests\n\n* [tests](tests)\n    * [inputs](tests/inputs)\n        * [test_sample_info.yaml](tests/inputs/test_sample_info.yaml)\n\nThis contains 'fake' samples that are intended to test validation and repair\n\n## Schema Validation\n\nSee the [schema](sample_annotator/model/schema) folder -- this contains a copy of the LinkML rendering of the MIxS\nschema from [mixs-source](https://github.com/cmungall/mixs-source) which will later be integrated by GSC\n\n## Modules\n\n* [geo](sample_annotator/geolocation)\n    - currently this requires a googlemaps API key\n    - TODO: rewrite to use ORNL Identify\n* [measurements](sample_annotator/measurements)\n    - uses quantulum\n    - TODO: use http://units.ontodev.com/\n* [text mining](sample_annotator/text_mining)\n    - basic repair\n    - NER using fields such as study fields\n* [ontology](sample_annotator/ontology)\n    - LinkML enumerations to ontologies\n\nEach module will take care of different aspects\n\nFor example, the measurement module will normalized all fields in the schema with range QuantityValue\n\nE.g. Input:\n\n```yaml\nsample:\n  id: TEST:1\n  alt: 2m\n  ...\n```\n\nRepair Output:\n\n```yaml\nsample:\n  id: TEST:1\n  alt:\n    has_numeric_value: 2.0\n    has_raw_value: 2m\n    has_unit: metre\n    ...\n```\n\n## Starting the web API\n\n- TODO: write flask code\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrobiomedata%2Fsample-annotator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrobiomedata%2Fsample-annotator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrobiomedata%2Fsample-annotator/lists"}