{"id":37608681,"url":"https://github.com/krishnanlab/txt2onto","last_synced_at":"2026-01-16T10:15:12.670Z","repository":{"id":45647706,"uuid":"362902012","full_name":"krishnanlab/txt2onto","owner":"krishnanlab","description":"Code for classifying unstructured text to tissue ontology terms using natural language processing and machine learning.","archived":false,"fork":false,"pushed_at":"2022-10-20T21:57:12.000Z","size":66934,"stargazers_count":23,"open_issues_count":2,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-05-14T00:07:20.491Z","etag":null,"topics":["annotations","nlp","omics-samples"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krishnanlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-29T17:52:30.000Z","updated_at":"2024-05-01T15:48:35.000Z","dependencies_parsed_at":"2023-01-20T16:46:52.230Z","dependency_job_id":null,"html_url":"https://github.com/krishnanlab/txt2onto","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/krishnanlab/txt2onto","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2Ftxt2onto","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2Ftxt2onto/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2Ftxt2onto/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2Ftxt2onto/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krishnanlab","download_url":"https://codeload.github.com/krishnanlab/txt2onto/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2Ftxt2onto/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478049,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T06:30:42.265Z","status":"ssl_error","status_checked_at":"2026-01-16T06:30:16.248Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotations","nlp","omics-samples"],"created_at":"2026-01-16T10:15:12.581Z","updated_at":"2026-01-16T10:15:12.646Z","avatar_url":"https://github.com/krishnanlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Systematic tissue annotations of genomic samples by modeling unstructured metadata\n\nThis repo is the home of `txt2onto`, a Python utility for classifying unstructured text to terms in a tissue ontology using NLP-ML – a combination of natural language processing (NLP) and machine learning (ML). Also in this repo are our fully trained NLP-ML models to perform the tissue classification on unstructured text. We have included sample inputs and outline the use of NLP-ML with a demo script.\n\nThe NLP-ML method is described in this [preprint](https://doi.org/10.1101/2021.05.10.443525) `bioRxiv DOI: 10.1101/2021.05.10.443525`.\n\n\n## More Info\n\nThere are currently \u003e1.3 million human –omics samples that are publicly available. However, this valuable resource remains acutely underused because discovering samples, say from a particular tissue of interest, from this ever-growing data collection is still a significant challenge. The major impediment is that sample attributes such as tissue/cell-type of origin are routinely described using non-standard, varied terminologies written in unstructured natural language. Here, we provide a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomic samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample text descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms in a structured ontology. Our approach significantly outperforms representative methods of existing state of the art approaches to addressing the sample annotation problem. We have also demonstrated the biological interpretability of tissue NLP-ML models using an analysis of their similarity to each other and an evaluation of their ability to classify tissue- and disease-associated biological processes based on their text descriptions alone. \n\nPrevious studies have shown that the molecular profiles associated with genomic samples are highly predictive of a variety of sample attributes. Using transcriptome data, we have shown that NLP-ML models can be nearly as accurate as expression-based models in predicting sample tissue annotations. However, the latter (models based on genomic profiles) need to be trained anew for each genomic experiment type. On the other hand, once trained using any text-based gold-standard, approaches such as NLP-ML can be used to classify sample descriptions irrespective of sample type. We demonstrated this versatility by using NLP-ML models trained on microarray sample descriptions to classify RNA-seq, ChIP-seq, and methylation samples without retraining.\n\nHere, we provide the fully trained models and a simple utility script for users to leverage the predictive power of NLP-ML to annotate their text corpora of interest for 346 tissues and cell-types from [UBERON](https://www.ebi.ac.uk/ols/ontologies/uberon).  These NLP-ML models are trained using our full gold standard. As a note: in our manuscript, we discussed the results of models who had sufficient training data in the gold standard for at least 3-fold CV. The remaining models were not discussed or examined in detail in our work due to lack of sufficient labeled samples. We have included trained models for all available tissues and cell-types so as to provide users with the maximum amount of predictive capability. However, it should be noted that some models included in this repository have very little training data (i.e., small number of positively labeled examples) and thus may provide inaccurate annotations. The full list of cross-validated models can be found [here](https://github.com/krishnanlab/txt2onto/blob/main/gold_standard/CrossValidatedModels.txt), and the full list of models presented in our paper can be found [here](https://github.com/krishnanlab/txt2onto/blob/main/gold_standard/ManuscriptModels.txt).\n\n\n## Installation\n\nRequirements can be installed by running `pip install -r requirements.txt`. These requirements have been verified for python version 3.7.7.\n\nAn update in library dependencies requires scikit-learn to be version 0.23.2 as listed in `requirements.txt`. The full models were trained using scikit-learn version 0.23.1. When loading the models, the following warning message will be shown:\n\n```\nUserWarning: Trying to unpickle estimator LogisticRegression from version 0.23.1 \nwhen using version \u003cYOUR CURRENT VERSION\u003e. This might lead to breaking code or invalid results. \nUse at your own risk. warnings.warn()\n```\n\nIn our testing with newer versions of scikit-learn, we have encountered no problems. If a problem does arise, please post a git issue and we will work to resolve it. \n\n## Docker installation\n\nSince getting the correct versions and dependencies can sometimes be challenging, we make a `Dockerfile` available for building locally or for extending as needed.\n\n```\ndocker build .\n```\n\nThe resulting docker image will contain the checked out version of the repo. After `docker run ...`, continue through the Usage sections as below.\n\n## Usage\n\n### Use Case 1: Making predictions on unstructured text using NLP-ML\n\n#### Input\n\nThe input should be a plain text file with one description per line. An example is provided [here](https://github.com/krishnanlab/txt2onto/blob/main/data/example_input.txt) with a small excerpt below.\n\n```\nna colon homo sapiens colonoscopy male adenocarcinoma extract total rna le biotin norm specified colonoscopy male adenocarcinoma specified ...\nhomo sapiens adult allele multiple sclerosis clinically isolated syndrome none peripheral blood mononuclear human pbmc non treated sampling ...\nwholeorganism homo sapiens prostate prostate patient id sample type ttumor biopsy ctrlautopsy sample percentage tumor prostate patient ...\nnormal adult patient normal adult patient age gender male skeletal muscle homo sapiens unknown extract total rna le biotin norm unknown ...\nmedium lp stimulation blood lp homo sapiens myeloid monocytic cell medium lp stimulation extract total rna le biotin norm medium lp stimulation ...\n```\n\nThe input text will be preprocessed during the execution of `src/txt2onto.py`. For more information on the preprocessing pipeline, see the `preprocess` function in `src/utils.py`.\n\n#### Output\n\nThe prediction task can then be performed by running the following:\n\n```\npython txt2onto.py --file /path/to/text/file.txt --out /path/to/write/embeddings/to.txt --predict\n```\n\nThis will read in the input text from `path/to/text/file.txt`, create a word embedding for each line of text and write it to `/path/to/write/embeddings/to.txt`, and then make a prediction for each line of text for each of our models and write it to `/path/to/write/embeddings/predictions_to.txt`. The output path for predicted probabilities is automatically generated when the flag is passed. The i,j entry of the output dataframe is the predicted probability assigned by model i for text snippet j from the input file. If you only want embeddings for your input text, omit the `--predict` flag.\n\nAlternatively, a single text snippet can be read from the command line:\n\n```\npython txt2onto.py --text SOME SAMPLE DESCRIPTION OR PIECE OF TEXT --out /path/to/write/embeddings/to.txt --predict\n```\n\nWhich will write a single word embedding to `/path/to/write/embeddings/to.txt` and write the predictions to `/path/to/write/embeddings/predictions_to.txt`. \n\nIf the user only wants word embeddings, the `--predict` flag can be omitted. Word embeddings are always generated and written to file whether predictions are made or not.\n\n#### Demo\n\nFor an example, run `sh demo.sh` in the `src/` directory. The script will execute the full predictive pipeline of NLP-ML. Each line of input text will be turned into a numerical representation by taking a weighted average of the individual word embeddings from the text, and run each embedding through each of our trained NLP-ML models. \n\n```bash\ncd src/\nsh demo.sh\n```\n\nThis will read in the example input file from `data/example_input.txt`, write embeddings to `out/example_output.txt`, and write predictions to `out/predictions_example_output.txt`.\n\n### Use Case 2: Training new NLP-ML models\n\nGiven labeled text snippets, new NLP-ML models can be trained for additional tissues, cell types, or potentially other binary classification problems. \n\nIn order to train a new model, an input file of the following is required:\n\n```\nID1  1  TEXT\nID2 -1  TEXT\nID3 -1  TEXT\nID4  1  TEXT    \n```\n\nThe input file is a three column, tab-separated file. Column 1 is the ID corresponding to the sample, input text, etc. Column 2 is the true label, either +1 or -1. The final column is the text associated with the sample. An example training input can be found at `../training_inputs/UBERON-0002113_NLP-ML-input_SUBSET.txt`, which is a subset of our gold standard for UBERON:0002113, kidney. A full training input for any tissue or cell type in our gold standard can be created by running the following from the `src/` directory:\n\n```\npython input.py --ont UBERON:0002113 --out ../training_inputs/\n```\n\n\nWhen trying to make a training input from our gold standard, if an ontology term is specified that we do not have true labels for, a `ValueError` will be raised. Once an input of the specified format is created, a model can be trained from the given input using the following:\n\n```\npython train.py --input ../training_input/labeled_input.txt --out ../user_trained_models/\n```\n\nThis will train a model from the embeddings created from the text in the input file, and save the model as a pickled object to the specified output directory. The output filename is automatically generated from the training input. For example, an input of `../training_inputs/UBERON-0002113_NLP-ML-input_SUBSET.txt` used to train a model with a specified output of `../user_trained_models/` will be saved as  `../user_trained_models/MODEL_UBERON-0002113_NLP-ML-input_SUBSET.p`.\n\nTo use newly trained models to make predictions as opposed to our trained NLP-ML models, a new model directory can be specified when running `txt2onto.py`. For example, user trained models in the `../user_trained_models/` directory can be used instead of the `../bin/` directory by specifying with the `--models` flag in `txt2onto`:\n\n```\npython txt2onto.py --file /path/to/text/file.txt --out /path/to/write/embeddings/to.txt --predict --models ../user_trained_models/\n```\n\nThis will allow you to make predicitions on new unstructured text using your own trained models.\n\n## Overview of Repository\n\nHere, we list the files we have included as part of this repository.\n\n* `bin/` - The fully trained Logistic Regression models stored as pickle (`.p`) files\n* `data/` - Example input file and files needed for making embeddings and output predictions\n    * `data/UBERONCL.txt` - A text file that maps the model ontology identifiers to plain text\n    * `data/pubmed_weights.txt.gz` - IDF weights for every unique word across PubMed used to make a weighted average embedding for each piece of text\n* `gold_standard/` - Raw datafiles from our manuscript\n    * `gold_standard/AnatomicalSystemsPerModel.json` - Mapping of every term in UBERON to a high-level anatomical system\n    * `gold_standard/CrossValidatedModels.txt` - A list of models we had sufficient positively labeled training data to perform cross validation on\n    * `gold_standard/GoldStandardLabelMatrix_PlainText.csv` - Our manually annotated gold standard in plain text\n    * `gold_standard/GoldStandardLabelMatrix.csv` - Our manually annotated gold standard with ontology identifiers\n    * `gold_standard/GoldStandard_Propagated.txt` - Our manually annotated gold standard with a list of annotations for each sample not in matrix form\n    * `gold_standard/GoldStandard_Sample-Descriptions.txt` - Sample descriptions for the samples in our gold standard\n    * `gold_standard/GoldStandard_Sample-IDS.txt` - Sample and experiment labels corresponding to `gold_standard/GoldStandard_Sample-Descriptions.txt`\n    * `gold_standard/GoldStandard_Unpropagated.txt` - The original gold standard manual annotations [1]\n    * `gold_standard/ManuscriptModels.txt` - A list of the models we evaluated and showed results for in our manuscript, a subset of `gold_standard/CrossValidatedModels.txt`\n    * `gold_standard/ModelsPerAnatomicalSystem.json` - Mapping that lists the tissues and cell-types that belong to each high-level anatomical system\n* `src/` - Main source directory\n    * `src/demo.sh` - Runs an example of the pipeline\n    * `src/txt2onto.py` - Primary file for making predictions on input text\n    * `src/utils.py` - Utility file containing tools for making predictions on input text\n* `out/` - Example directory to send outputs to\n* `paper_results/` - Files with raw values used to create figures from our publication\n* `training_inputs/` - Directory containing example input file for training new NLP-ML models\n* `user_trained_models/` - Directory where user trained models can be stored using included demo\n\n\n## Additional Information\n\n### Support\nFor support, please contact [Nat Hawkins](http://www.nathawkins.info/) at hawki235@msu.edu.\n\n### Inquiry\nAll general inquiries should be directed to [Arjun Krishnan](www.thekrishnanlab.org) at arjun@msu.edu.\n\n### License\nThis repository and all its contents are released under the [Creative Commons License: Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode); See [LICENSE.md](https://github.com/krishnanlab/txt2onto/blob/main/LICENSE).\n\n### Citation\nIf you use this work, please cite:  \n**Systematic tissue annotations of genomic samples by modeling unstructured metadata**  \nNathaniel T. Hawkins, Marc Maldaver, Anna Yannakopoulos, Lindsay A. Guare, Arjun Krishnan  \n_bioRxiv_ 2021.05.10.443525; doi: https://doi.org/10.1101/2021.05.10.443525\n\n### Funding\nThis work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to AK and in part by MSU start-up funds to AK and MSU Rasmussen Doctoral Recruitment Award and Engineering Distinguished Fellowship to NTH.\n\n### Acknowledgements\nThe authors would like to thank [Kayla Johnson](https://sites.google.com/view/kaylajohnson/home) for their support and feedback on the manuscript, and all members of the [Krishnan Lab](www.thekrishnanlab.org) for valuable discussions and feedback on the project.\n\n### References\n[1] **Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies**. Lee Y, Krishnan A, Zhu Q, Troyanskaya OG. Bioinformatics (2013) 29:3036-3044.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishnanlab%2Ftxt2onto","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrishnanlab%2Ftxt2onto","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishnanlab%2Ftxt2onto/lists"}