{"id":20371863,"url":"https://github.com/biosustain/omics_valid","last_synced_at":"2026-05-09T08:31:42.306Z","repository":{"id":77443657,"uuid":"421512559","full_name":"biosustain/omics_valid","owner":"biosustain","description":"Specifications and validators for omics data formats used on C. autoethanogemum C1 program","archived":false,"fork":false,"pushed_at":"2022-01-26T09:55:54.000Z","size":435,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-12-02T19:18:15.013Z","etag":null,"topics":["bioinformatics","biology","omics","parser"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/biosustain.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-26T17:01:06.000Z","updated_at":"2021-12-10T09:47:21.000Z","dependencies_parsed_at":"2023-06-14T13:00:18.391Z","dependency_job_id":null,"html_url":"https://github.com/biosustain/omics_valid","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/biosustain/omics_valid","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biosustain%2Fomics_valid","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biosustain%2Fomics_valid/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biosustain%2Fomics_valid/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biosustain%2Fomics_valid/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/biosustain","download_url":"https://codeload.github.com/biosustain/omics_valid/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/biosustain%2Fomics_valid/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32812240,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-08T08:22:46.396Z","status":"online","status_checked_at":"2026-05-09T02:00:06.633Z","response_time":123,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","biology","omics","parser"],"created_at":"2024-11-15T01:10:23.849Z","updated_at":"2026-05-09T08:31:42.285Z","avatar_url":"https://github.com/biosustain.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# omics_valid\n\nThis repository serves two purposes:\n\n1. Define specifications for our OMICS formats.\n2. Provide a validator to ease the use of the specifications.\n\n#### Table of contents\n\u003c!--ts--\u003e\n   * [Installation](#installation)\n      * [Building from source](#building-from-source)\n   * [Specifications](#supported-specifications)\n      * [Proteomics](#proteomics)\n      * [Tidy Proteomics](#tidy-proteomics)\n      * [Metabolomics](#metabolomics)\n   * [Usage](#usage)\n\u003c!--te--\u003e\n\n## Installation\n\nBinaries for Linux, Mac and Windows are released on every tag.\n\n1. Go to the [releases page](https://github.com/biosustain/omics_valid/releases/).\n2. Look for your platform in the file names (check for _apple_ or _windows_ under _Assets_) and download the file:\n\t- If Linux, you probably want the file which has `gnu` in the name.\n\t- If Mac, there is only one file.\n\t- If Windows, you probably want the `.zip` file.\n3. Unpack it, a binary file `omics_valid` should have been extracted.\n4. (Optional) Put the extracted file `omics_valid` under your PATH.\n\n### Building from source\n\nInstall [cargo](https://doc.rust-lang.org/cargo/getting-started/installation.html) and run\n\n```\ngit clone https://github.com/biosustain/omics_valid.git\ncd omics_valid\ncargo install --path .\n```\n\n## Supported specifications\n\n### Proteomics\nProtein CSV **without header** in the form\n\n```csv\nUNIPROT_ID,NUMBER_VALUE_SAMPLE1,NUMBER_VALUE_SAMPLE2\n```\n\nwith an arbitrary number of samples. It will report:\n* Invalid Uniprot IDs.\n\nExample:\n\n```csv\nQ00496,100001,21283\nQ7B2Q4,123.3444,0\nE0X9C7,10.2,21283\nE0X97,1001,21283\nE0X9C7,1000.2,23131\n```\n\nRunning the command\n\n```shell\nomics_valid --format prot tests/uni.csv\n```\n\nwould output\n\n```\n1 lines[4]: E0X97 invalid Uniprot ID\n```\n\nsince \"E0X97\" is not a valid Uniprot ID.\n\n### Tidy Proteomics\n\nProtein CSV  in the following tidy (see tidy data, [Hadley Wickham, 2014](https://www.jstatsoft.org/article/view/v059i10)) form:\n\n```csv\nuniprot,sample,value\nUNIPROT_ID,SAMPLE_NAME,NUMBER_VALUE\n```\n\nIt will report:\n* Invalid Uniprot IDs.\n* Empty samples names.\n\nExample:\n\n```csv\nuniprot,sample,value\nQ00496,cauto_h2,100001\nQ7B2Q4,cauto_h2,100.2\nE0X9C7,SIM3,203\n```\n\nRunning the command\n\n```shell\nomics_valid --format tidy_prot tests/uni_tidy.csv\n```\n\nwon't output anything since the file is properly following the specification.\n\n### Metabolomics\nMetabolomics CSV  in the following tidy (see tidy data, [Hadley Wickham, 2014](https://www.jstatsoft.org/article/view/v059i10)) form:\n\n```csv\nmet_id,sample,value\nMETABOLITE_IDENTIFIER,SAMPLE_NAME,NUMBER_VALUE\n```\n\nIt will report:\n* Identifier not found in the supplied SBML model.\n* Empty samples names.\n\nExample:\n\n```csv\nmet_id,sample,value\nglc__D,SIM1,2\ncpd00067,SIM3,1032\nclearly_not_a_metabolite,SIM1,2921\nacon_C,SIM1,18\nMNXM83,SIM2,317\n```\n\nRunning the command\n\n```shell\nomics_valid --format met --model tests/iCLAU786.xml tests/met_tidy.csv\n```\n\nwould output:\n\n```\n1 lines[4]: clearly_not_a_metabolite not in model!\n```\n\n### Transcriptomics\n\nRNA files for iModulon. These are experiments from SRA or local files.\n\n```csv\nExperiment,LibraryLayout,Platform,Run,R1,R2\nString,Single|Paired,ILLUMINA|PACBIO_SMRT|ETC,None|Number,None|path/to/file,None|path/to/file\n```\n\nIt may contain other fields. The validator will check the following (taken from [modulome-workflow](https://github.com/avsastry/modulome-workflow/tree/65c5bd3c9facef6a41899429403c531923aa5204/2_process_data#setup)):\n\n1. `Experiment`: For public data, this is your SRX ID. For local data, data should be named with a standardized ID (e.g. ecoli_0001)\n1. `LibraryLayout`: Either PAIRED or SINGLE\n1. `Platform`: Usually ILLUMINA, ABI_SOLID, BGISEQ, or PACBIO_SMRT\n1. `Run`: One or more SRR numbers referring to individual lanes from a sequencer. This field is empty for local data.\n1. `R1`: For local data, the complete path to the R1 file. If files are stored on AWS S3, filenames should look like `s3://\u003cbucket/path/to\u003e.fastq.gz`. `R1` and `R2` columns are empty for public SRA data.\n1. `R2`: Same as R1. This will be empty for SINGLE end sequences.\n\nAdditionally, the FASTQ files in R1 and R2 will be checked if present for possible format errors.\n\n```shell\nomics_valid -f rna tests/rna.csv\n```\n\nwould output\n\n```\n1 lines[35]: ./tests/data/some.fastq: Declared FASTQ path does not exist!\n1 lines[36]: ./tests/data/some.fastq: Declared FASTQ path does not exist!;\tInconsistent experiment: R1 and R2 did not match the LibraryLayout! (assuming local data since field 'Run' is empty)\n1 lines[38]: ./tests/invalid.fastq: failure reading FASTQ! One record is incorrect\n```\n\nAs can be seen, when more than one error is found in a single record,\nthe errors are concatenated with a \";\\t\".\n\n### Usage\n\n```shell\n$ omics_valid --help\nUsage: omics_valid [\u003cfile\u003e] [-f \u003cformat\u003e] [-m \u003cmodel\u003e] [-v]\n\nOmics format validator.\n\nPositional Arguments:\n  file              input omics file.\n\nOptions:\n  -f, --format      format of the file. Currently supported: {prot, tidy_prot,\n                    met, rna}\n  -m, --model       path to SBML model file, used for metabolite verification\n  -v, --version     display the version\n  --help            display usage information\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiosustain%2Fomics_valid","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbiosustain%2Fomics_valid","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbiosustain%2Fomics_valid/lists"}