{"id":25843088,"url":"https://github.com/incatools/biosample-analysis","last_synced_at":"2026-03-07T03:03:33.478Z","repository":{"id":46427015,"uuid":"286873100","full_name":"INCATools/biosample-analysis","owner":"INCATools","description":"analysis of biosamples in INSDC","archived":false,"fork":false,"pushed_at":"2024-02-13T02:13:22.000Z","size":102329,"stargazers_count":3,"open_issues_count":26,"forks_count":1,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-03T21:12:08.741Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/INCATools.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-11T23:57:58.000Z","updated_at":"2022-01-05T19:25:22.000Z","dependencies_parsed_at":"2025-03-01T06:48:14.216Z","dependency_job_id":null,"html_url":"https://github.com/INCATools/biosample-analysis","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/INCATools/biosample-analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INCATools%2Fbiosample-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INCATools%2Fbiosample-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INCATools%2Fbiosample-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INCATools%2Fbiosample-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/INCATools","download_url":"https://codeload.github.com/INCATools/biosample-analysis/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INCATools%2Fbiosample-analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30206339,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T19:07:06.838Z","status":"online","status_checked_at":"2026-03-07T02:00:06.765Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-01T06:38:07.871Z","updated_at":"2026-03-07T03:03:33.461Z","avatar_url":"https://github.com/INCATools.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Biosample analysis\n\nRepo for analysis of biosamples in INSDC\n\nQuestions to explore\n\n - which attributes/properties are used\n - are these conformant to standards?\n    - E.g. are MIxS fields used\n    - Does the range constraint apply?\n - Can we mine ontology terms, e.g. ENVO from text descriptions\n - can we auto-populate metadata fields\n\n# Workflow\n\nSee Makefile for details\n\n# Analysis Data\nIn addition to the data in the target directory, sample data that is too large for GitHub is stored our Google drive [here](https://drive.google.com/drive/u/1/folders/1eL0v0stoduahjDpoDJIk3z2pJBAU4b2Y).  \nFiles include:\n- **biosample_set.xml.gz**  \n  This is the full raw biosample dataset formatted as XML.\n- **harmonized-values-eav.tsv.gz**  \n  A tab-delimited file containing data extracted from `biosample_set.xml.gz` that contains the biosample's primary id and only the biosample attributes that have `harmonized_name` property.\n  The data is in entity-attribute-value ([EAV](https://en.wikipedia.org/wiki/Entity–attribute–value_model)) format. The columns in the file are `accession|attribute|value` (`accession` is the accession number of the biosample).  \n  If necessary, use `make target/harmonized-table.tsv` to create the (non-zipped) file locally.   \n- **harmonized-table.tsv.gz**  \n  A tab-delimited file in the data from `harmonized-table.tsv.gz` has been \"pivoted\" into a standard tabular format (i.e., the attributes are column headers).\n  If necessary, use `make harmonized-table.tsv` to create the (non-zipped) file locally.   \n- **harmonized-attribute-value.ttl.gz**    \n  A tab-delimited file in which the data from `harmonized-values-eav.tsv.gz` have been transformed into sets of turtle triples.  \n  If necessary, use `make harmonized-attribute-value.ttl` to create the (non-zipped) file locally.  \n- **harmonized-table.parquet.gz**  \n  A parquet file containing the same contents as `harmonized-table.tsv.gz`. In pandas, you load like this: `df = pds.read_parquet('harmonized-table.parquet.gz')`  \n  You will need to have `pyarrow` installed (i.e., `pip install pyarrow`).  \n  If necessary, use `make target/harmonized-table.parquet.gz` to create the parquet file locally.  \n  Details of how to save the harmonized dataframe in parquet are found in [save-harmonized-table-to-parquet.py](util/save-harmonized-table-to-parquet.py). \n  \n- **harmonized_table.db.gz**     \n  An sqlite database in which the `biosample` table contains the contents of `harmonized-table.tsv.gz`. Data is loaded into a pandas dataframe like this:\n  ```\n  con = sqlite3.connect('harmonized_table.db') # connect to database\n  df = pds.read_sql('select * from biosample limit 10', con) # test loading 10 records\n  ```\n  **NB:** Loading all records (i.e, `df = pds.read_sql('select * from biosample', con)`) is a **VERY** time consuming and memory intensive. I gave up after letting the process run for 4 hours.\n  If necessary, use `make target/harmonized_table.db` to create the (non-zipped) sqlite database locally.  \n  Details of how to save the harmonized dataframe in sqlite are found in [save-harmonized-table-to-sqlite.py](util/save-harmonized-table-to-sqlite.py)\n  \n# Related \n\nhttps://github.com/cmungall/metadata_converter\n\nhttps://academic.oup.com/database/article/doi/10.1093/database/bav126/2630130\n\n# Example bad data\n\n## Depth\n\nMIxS specifies this should be `{number} {unit}`\n\nSome example values that do not conform:\n\n - N40.1164_W88.2543\n - 25 santimeters\n - 0 – 20 cm\n - 3.149\n - 30-60cm replicate6\n - 1800, 1800\n - 30ft\n - 5m, 32m, 70m, 110m, 200m, 320m, 1000m\n - Surface soil from deep water\n - 0 m water depth\n - Metamorph4 (19dpf) biological replicate 3\n\n## pH\n\n - pH 7.9\n - 6.0-9.5\n - 8,156\n - NA1\n - 2.75 (orig)\n - 5.11±0.10\n - Missing: Not reported\n - Not collected\n - 7.0-7.5 um\n - Moderately alkaline\n\nNote that missing values do not correspond to:\n\nhttps://gensc.org/uncategorized/reporting-missing-values/\n\n## ammonium\n\nShould be {float} {unit}\n\n - 0.71 micro molar\n - 14.941\n - -0.024\n - 1.9 g NH4-N L-1\n - Below the deteciton limit (2 microM)\n - 3.09µg/L\n\nUnits vary from 'micro molar' through uM through mg/L\n\n## geo_loc_name\n\nMIxS:\n\n_The geographical origin of the sample as defined by the country or sea name followed by specific region name. Country or sea names should be chosen from the INSDC country list (http://insdc.org/country.html), or the GAZ ontology (v 1.512) (http://purl.bioontology.org/ontology/GAZ)_\n\n` {term};{term};{text}`\n\n - USA: WA\n - USA:MO\n - USA: Boston, MA\n - USA:CA:Davis\n - United Kingdom: Midlands and East of England\n - Malawi: GAZ\n \n \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fincatools%2Fbiosample-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fincatools%2Fbiosample-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fincatools%2Fbiosample-analysis/lists"}