{"id":25608386,"url":"https://github.com/antonkulaga/yspecies","last_synced_at":"2026-05-14T22:32:09.655Z","repository":{"id":73828132,"uuid":"321498059","full_name":"antonkulaga/yspecies","owner":"antonkulaga","description":"Code from \"Machine learning analysis of longevity-associated gene expression landscapes in mammals\" paper.","archived":false,"fork":false,"pushed_at":"2022-08-03T23:26:45.000Z","size":24448,"stargazers_count":0,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-01-27T02:03:17.641Z","etag":null,"topics":["aging","bioinformatics","comparative-transcriptomes","dvc","lifespan","longevity","machine-learning","ml","paper-implementations","shap","species","transcriptomics"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/antonkulaga.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-12-14T23:20:52.000Z","updated_at":"2022-01-09T21:18:05.000Z","dependencies_parsed_at":"2023-09-26T09:59:53.039Z","dependency_job_id":null,"html_url":"https://github.com/antonkulaga/yspecies","commit_stats":{"total_commits":121,"total_committers":3,"mean_commits":"40.333333333333336","dds":"0.016528925619834656","last_synced_commit":"37a1d905e603070df6e8e7760a045b0552fecd95"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antonkulaga%2Fyspecies","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antonkulaga%2Fyspecies/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antonkulaga%2Fyspecies/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antonkulaga%2Fyspecies/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/antonkulaga","download_url":"https://codeload.github.com/antonkulaga/yspecies/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240083141,"owners_count":19745356,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aging","bioinformatics","comparative-transcriptomes","dvc","lifespan","longevity","machine-learning","ml","paper-implementations","shap","species","transcriptomics"],"created_at":"2025-02-21T20:29:07.067Z","updated_at":"2026-05-03T19:30:16.123Z","avatar_url":"https://github.com/antonkulaga.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"YSpecies\n========\n\nThis repository like a double-edged sword serves two purposes:\n* Running cross-species analyses on the data collected by the Cross-Species project of the [Systems Biology of Aging Group](http://aging-research.group)\n* Reproducing the analysis of \"Machine learning analysis of longevity-associated gene expression landscapes in mammals\" paper\n\n\u003e If you are using the code or data from this project, please do not forget to reference our paper. \n\u003e If you have any questions regarding the data, the code, or the paper, feel free to contact [Systems Biology of Aging Group](http://aging-research.group) or open an issue on github.\n\n## Role of this repository in cross-species machine learning pipeline ##\n\n![Cross-species Machine learning pipeline](/data/images/pipeline.png?raw=true \"Machine learning pipeline in the paper\")\n\nOn this figure we illustrate the core elements of the Cross-Species ML pipeline:\n\n### RNA-quantification ###\nFor downloading and preparing the indexes of reference genomes and transcriptomes [species-notebooks](https://github.com/antonkulaga/species-notebooks) repository can be used.\n\nFor RNA-Seq processing of samples [quantification](https://github.com/antonkulaga/rna-seq/tree/master/pipelines/quantification) pipeline can be used.\n\nFor uploading [Compara orthology data](ftp://ftp.ensembl.org/pub/current_compara) as well as quantified data of our samples to GraphDB database [species-notebooks](https://github.com/antonkulaga/species-notebooks) repository can be used.\n\n### LightGBM+SHAP stages I, II models ###\n\nTo reproduce stage I and II models current [yspecies](https://github.com/antonkulaga/yspecies) repository can be used (see documentation below)\nThere are dedicated notebooks devoted to those stages:\n* **stage_one_shap_selection notebook** contains stage one shap_selection code\n* **stage_two_shap_selection notebook** contains stage two shap_selection code\n\n### Other models ###\n\nLinear models are implemented in [cross-species-linear-models](https://github.com/ursueugen/cross-species-linear-models) repository\nBayesian networks analysis and multilevel Bayesian linear modelling are available at: [bayesian_networks_and_bayesian_linear_modeling](https://github.com/rodguinea/bayesian_networks_and_bayesian_linear_modeling) repository\n\nIn the same time, results of both of these models can be pulled by [DVC](https://dvc.org) in the current [yspecies](https://github.com/antonkulaga/yspecies) repository\n\n### Ranked results ###\n\nTo generate a ranked table current [yspecies](https://github.com/antonkulaga/yspecies) repository can be used (see documentation below)\nThere is a dedicated **results_intersections notebook** devoted to generating ranked tables.\n\n### LightGBM+SHAP stage III ###\n\nTo reproduce this stage you can use **stage_three_shap_selection notebook** notebook in the notebooks folders\n\nProject structure\n-----------------\n\nIn the _data_ folder one keeps _input_, _interim_ and _output_ data. \n\nBefore you start running anything do not forget to dvc pull the data and after commiting do not forget to dvc push it!\n\nThe pipeline is run by running dvc stages (see dvc.yaml file)\n\nMost of the analysis is written in jupyter notebooks in the notebooks folder.\n\nEach stage runs (and source controls input-outputs) corresponding notebooks using papermill software (which also stores output of the notebooks to data/notebooks)\n\n\nGetting started\n-------------------\n\nYou can either use micromamba/conda/anaconda or docker container to setup the project.\n\n### Micromamba/Conda setup\n\nFirst you have to create a [Conda environment](https://docs.conda.io/en/latest/miniconda.html) or [Micromamba environment](https://github.com/mamba-org/mamba) for the project:\nMicromamba is a superior alternative to Conda with very similar API.\n\nTo create environment you can do:\n```bash\nmicromamba create --file environment.yaml\nmicromamba activate yspecies\n```\nIf any errors occur when setting up please, read known issues on the bottom of README.md If the problem is not mentioned there - feel free to open a github issue.\n\nThen you have to pull the data with DVC, for this you should activate yspecies environment, and then:\n```\ndvc pull\n```\nNOTE: we keep the data at GoogleDrive, so on the first run of `dvc pull` it may give you a link to allow access to your GoogleDrive to download the project data, like this:\n![DVC confirm_permissions](/data/images/dvc_gdrive.png?raw=true \"Give Google Drive Permissions\") We are grateful for @shcheklein and @dmpetrov for their help with DVC configuration.\n\nAfter authentication, you can run any of the pipelines with:\n```\ndvc repro\n```\nor can run jupyter notebooks to explore notebooks on your own (see running notebooks section)\n\n### Docker setup\n\nAlternatively, you can use docker container that already contains micromamba environment with everything pre-installed.\nGet inside the container with:\n```\ndocker run -i -t --network host quay.io/comp-bio-aging/yspecies:latest\n```\nMicromamba environment will be automatically activated inside the container.\nTo reproduce the pipelines you can run:\n```\ndvc repro\n```\nYou can also pull the data and start jupyterlab to work with notebooks\n```bash\ndvc pull\njupyter lab notebooks --allow-root\n```\n\nRunning stages\n--------------\nDVC stages are in dvc.yaml file, to run dvc stage just use dvc repro \u003cstage_name\u003e:\n```bash\ndvc repro \n```\nMost of the stages also produce notebooks together with files in the output\n\n# Key notebooks #\n\nThere are several key notebooks in the projects. All notebooks can be run either from jupyter (by jupyter lab notebooks) or command-line by dvc repro.\n* **select_samples notebook** does preprocessing to select right combination of samples, genes and species. Most of other notebooks depend on it\n* **stage_one_shap_selection notebook** contains stage one shap_selection code\n* **stage_two_shap_selection notebook** contains stage two shap_selection code\n* **stage_three_shap_selection notebook** contains stage three shap_selection code\n* **results_intersections notebook** is used to compute intersection tables taken from several analysis methods (linear,causal and shap)\n* For each of the stages there are also **stage_\u003cnumber\u003e_optimize** notebooks which contain hyper-parameter optimization code\n## Running notebooks manually ##\n\nYou can run notebooks manually by activating yspecies environment and running:\n```bash\njupyter lab notebooks\n```\nand then running the notebook of our choice. \nHowever, keep in mind that notebooks depend on each other.\nIn particular, select_samples notebook generates the data for all others.\n\n\n# Core SHAP selection logic #\n\nMost of the code is packed into classes. The workflow is build on top of scikitlean Pipelines. For the in-depth description of the pipeline read Cross-Species paper.\n\n# Yspecies package #\n\nYspecies package has the following modules:\n* dataset - ExpressionDataset class to handle cross-species samples, genes, species metadata and expressions\n* partition - classes required for sci-kit-learn pipeline starting from ExpressionDataset going to SortedStratification\n* helpers - auxiliary methods\n* preprocess - classes for preprocessing steps of the cross-species pipeline\n* config - project-specific config values (for example, folder locations)\n* tuning - classes for hyperparametric optimization\n* workflow - general classes with advanced scikit-learn workflow building blocks\n* models - cross-validation models and metrics\n* selection - LightGBM and SHAP-based feature selection\n* explanations - FeatureSelection results, plots and auxiliary methods to explor them\n* utils - various utility functions and classes\n* workflow - helper classes required to reproduce pipelines in the paper\n\nThe code in yspecies folder is a conda package that is used inside notebooks. There is also an option to use a [conda version of the package](https://anaconda.org/antonkulaga/yspecies)\n\n## ExpressionDataset ##\n\nOne of the key classes is ExpressionDataset class:\n```python\ne = ExpressionDataset(\"5_tissues\", expressions, genes, samples)\ne\n```\nIt allows indexing by genes:\n```python\ne[[\"ENSG00000073921\", \"ENSG00000139687\"]]\n#or\ne.by_genes[[\"ENSG00000073921\", \"ENSG00000139687\"]]\n```\nBy samples:\n```python\ne.by_samples[[\"SRR2308103\",\"SRR1981979\"]]\n```\nBoth:\n```python\ne[[\"ENSG00000073921\", \"ENSG00000139687\"],[\"SRR2308103\",\"SRR1981979\"]]\n```\n### Filtering ###\nExpressionDataset class has by_genes and by_samples properties which allow indexing and filtering.\nFor instance filtering only blood tissue:\n```python\ne.by_samples.filter(lambda s: s[\"tissue\"]==\"Blood\")\n```\n\nThe class is also Jupyter-friendly with _repr_html_() method implemented\n\n\n## partition module ##\n\nKey logic from the start until partitioning of the data according to sorted stratification\n\n\nClasses with data:\n* FeatureSelection - specifies which fields we want to select from ExpressionDataset's species, samples, genes\n* EncodedFeatures - class responsible for encoding of categorical features\n* ExpressionPartitions - data class with results of partitioning\n\nTransformers:\n* DataExtractor - transformer that get ExpressionDataset and extracts data from it according to FeatureSelection instruction\n* DataPartitioner - transformer that does sorted stratification\n\n## selection module ##\n\nThis module is responsible for ShapBased selection\n\nClasses with data:\n* Fold - results of one Fold\n\nAuxilary classes:\n* ModelFactory - used by ShapSelector to initialize the model\n* Metrics - helper methods to deal with metrics\n\nTransformers:\n\n* ShapSelector - key transformer that does the learning\n\n## results module ##\n\nModule that contains final results\n\n* FeatureResults is a key class that contains selected features, folds as well as auxiliary methods to plot and investigate results\n\n# KNOWN ISSUES #\n\nHere we list workarounds for some typical problems connected with running the repository:\n\n1) error trying to exec 'cc1plus': exe: No such file or directory\n\nSuch error emerges when g++ is not installed:\nThe workaround is simple:\n```\nsudo apt install g++\n```\n\n2) Failures to download the files: if one or more files were not downloaded, re-run dvc pull again!\n\n3) Windows and MAC-specific errors.\n\nEven though yspecies seems to work on MAC and windows, we used Linux as our main operating system and did not test it thoroughly on Windows and Mac, so feel free to report any issues with them.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fantonkulaga%2Fyspecies","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fantonkulaga%2Fyspecies","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fantonkulaga%2Fyspecies/lists"}