{"id":23109217,"url":"https://github.com/tcp-lab/transportome_profiler_test","last_synced_at":"2025-10-28T01:45:25.625Z","repository":{"id":267769912,"uuid":"756931756","full_name":"TCP-Lab/transportome_profiler_test","owner":"TCP-Lab","description":"R tests for the transportome_profiler project","archived":false,"fork":false,"pushed_at":"2024-12-12T09:54:57.000Z","size":77,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-09T11:12:19.928Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TCP-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-13T15:33:00.000Z","updated_at":"2024-12-12T09:55:02.000Z","dependencies_parsed_at":"2024-12-12T10:41:57.188Z","dependency_job_id":null,"html_url":"https://github.com/TCP-Lab/transportome_profiler_test","commit_stats":null,"previous_names":["tcp-lab/transportome_profiler_test"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TCP-Lab%2Ftransportome_profiler_test","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TCP-Lab%2Ftransportome_profiler_test/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TCP-Lab%2Ftransportome_profiler_test/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TCP-Lab%2Ftransportome_profiler_test/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TCP-Lab","download_url":"https://codeload.github.com/TCP-Lab/transportome_profiler_test/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247092392,"owners_count":20882218,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-17T01:34:34.596Z","updated_at":"2025-10-28T01:45:25.545Z","avatar_url":"https://github.com/TCP-Lab.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# How to run the ___Transportome Profiler___ locally\n\n## Clone the repo\nFor any Linux system, including WSL:\n```bash\ngit clone git@github.com:TCP-Lab/transportome_profiler.git\ncd ./transportome_profiler\n```\n\n## Install R requirements\n```bash\nRscript ./src/helper_scripts/install_r_pkgs.R\n```\n\n## (Create and) Activate a Python virtual environment and install requirements\n```bash\npython -m venv env\nsource ./env/bin/activate\npip install -r ./src/requirements.txt\n\n# To update internal packages (i.e., bonsai, gene_ranker, metasplit, and panid)\n# from the respective git repos  \npip install --force-reinstall -r ./src/requirements.txt\n\n# When finished:\ndeactivate\n```\nMore on Virtual Environments [here](https://docs.python.org/3/tutorial/venv.html).\n\n## Install `xsv`\nThis is a program required by `metasplit` for the fast reshaping of very large\nCSV files.\n```bash\nsudo pacman -Syu xsv\n```\n\n## Install `fast-cohen`\nThis is a program written in __Rust__ that performs a fast computation of the\nCohen's _d_ statistics.\n```bash\ncargo install --git https://github.com/MrHedmad/fast-cohen.git\n```\n\n## Install `Kerblam!`\nThis is our workflow manager.\n```bash\n# Install a prebuilt binary\ncurl --proto '=https' --tlsv1.2 -LsSf https://github.com/MrHedmad/kerblam/releases/latest/download/kerblam-installer.sh | sh\n\n# Or, alternatively, install it from source with cargo\ncargo install kerblam\n\n# Check it with\nkerblam --version\n```\n\n## Fetch MTP-DB, TCGA and GTEx datasets from remote (through XENA)\n```bash\nkerblam data fetch\n```\nwill download in `/data/in` the following objects:\n1. `expression_matrix.tsv.gz`, an archive containing all the log2 raw counts for\n\ttranscript abundance quantification, as provided by TCGA and GTEx projects;\n1. `expression_matrix_metadata.tsv.gz`, an archive containing the related\n\tmetadata for each sample (i.e., patient or, better, specimen);\n1. `MTPDB.sqlite.gz`, an archive containing our\n\t[Membrane Transport Protein Database](https://github.com/TCP-Lab/MTP-DB)\n\tused for gene set definition;\n1. `ensg_data.csv`, giving for each ENSG ID the corresponding HGNC gene symbol; \n1. `geo`, a folder containing the tables of raw counts and metadata for\n\tindividual small studies retrieved from GEO for validation purposes.\n\n### Raw counts\nThe `expression_matrix.tsv.gz` archive is our Zenodo copy of the\n_gene expression RNAseq - RSEM expected_count_ data set by\n[XENA](https://xenabrowser.net/datapages/?dataset=TcgaTargetGtex_gene_expected_count\u0026host=https%3A%2F%2Ftoil.xenahubs.net\u0026removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443),\ncontaining both TCGA and GTEx _harmonized_ transcriptomics data.\n- author: _UCSC TOIL RNA-seq recompute_\n- unit: __log2(expected_count+1)__\n- size: 60,499 identifiers (genes) x 19,109 samples\n- download: https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz\n\n#### Sample check\n```bash\nzcat expression_matrix.tsv.gz | head -n1 | grep -oE 'TCGA-|GTEX-|TARGET-|\\sK-' | wc -l\nzcat expression_matrix.tsv.gz | head -n1 | grep -oE 'TCGA-' | wc -l\nzcat expression_matrix.tsv.gz | head -n1 | grep -oE 'GTEX-' | wc -l\nzcat expression_matrix.tsv.gz | head -n1 | grep -oE 'TARGET-' | wc -l\nzcat expression_matrix.tsv.gz | head -n1 | grep -oE '\\sK-' | wc -l\n```\nreturned:\n| Source        | Sample size   |\n| ------------- |:-------------:|\n|**TOT**        |19,109         |\n|TCGA           |10,530         |\n|GTEX           |7,775          |\n|TARGET         |734            |\n|K              |70             |\n\n#### Gene check\n```bash\nzcat expression_matrix.tsv.gz | wc -l\n```\nreturned: `60499`.\n\n\n\n# Run `heatmaps` workflow on test data set\n## Generate the reduced data set for testing\nThis pipeline makes use of `metasample.py`\n```bash\nkerblam run gen_test_data\n```\n\u003e [!NOTE]  \n\u003e The accuracy of this step is verified by the `metasample` section of the\n\u003e `profiler_tests.R` script.\n\n## Run the analysis pipeline\nEdit the `./data/in/config/heatmaps_runtime_options.json` JSON file based on the desired options\n```json\n{\n    \"rank_method\": \"norm_fold_change\",\n    \"threads\": 3,\n    \"save_extra_plots\": false,\n    \"prune_similarity\": 0.9,\n    \"prune_direction\": \"bottomup\",\n    \"run_unweighted\": false,\n    \"alpha_threshold\": 0.20,\n    \"cluster_heatmap_cols\": false\n}\n```\nIn particular, use `generanker --list-methods` within the Python virtual environment, to see the available ranking metrics implemented by _Gene Ranker_. Then\n```bash\nkerblam run heatmaps -l --profile test\n```\nthis will run the following modules (from `./src/modules/`) along with the\nrelated dependencies:\n1. `ranking/select_and_run.py`\n\t- metasplit\n\t- gene_ranker\n\t\t- fast-cohen\n1. `make_genesets.py`\n\t- bonsai\n1. `run_gsea.R`\n1. `plotting/plot_large_heatmap.R`\n\n### Rank genes\n`metasplit` is the program used by `select_and_run.py` to parse the JSON query\nfile (`./data/in/config/DEA_queries/dea_queries.json`) and extract within-group (i.e., for each cancer type) case and control submatrices from the global expression matrix.\nThen, for each cancer type, `gene_ranker` is run to calculate the ranking metric selected through the `rank_method` JSON property.\nFinal ranks are saved in the `./data/deas/` directory as two-coulmn (sample,ranking) CSV tables.\n\n### Make the gene sets\n#### 1. Generate large tables\n`make_genesets.py` uses `make_large_tables()` Ariadne's function to make 9\nfundamental \"large tables\" based on the queries hardcoded in\n`./data/in/config/gene_lists/basic.json`\n```\nwhole_transportome\n|___pores\n|\t|___channels\n|\t|___aquaporins\n|___transporters\n\t|___solute_carriers\n\t|___atp_driven\n\t\t|___ABC\n\t\t|___pumps\n```\n#### 2. Generate lists from large tables\nFor each large table, `bonsai` is used to generate tree structures representing\nall the possible gene sets, based on the the 3 parameters of the function\n`generate_gene_list_trees()`, with the following default values:\n```python\nmin_pop_score: float = 0.5,\nmin_set_size: int = 10,\nmin_recurse_set_size: int = 40,\n```\nwith the following meaning:\n- `min_pop_score`: minimum portion of non-NA values in a column to be considered for gene lists.\n- `min_set_size`: Minimum number of genes to produce a valid gene set.\n- `min_recurse_set_size`: minimum parent-geneset size to have before running\nrecursion on it (effective if `recurse` boolean is `True`).\n\nThese parameters cannot be assigned by editing the `heatmaps_runtime_options.json` JSON file because they are considered lower-level parameters, however they can be passed as arguments to the `make_genesets.py` script within the `heatmaps.makefile` workflow. \n```make\n# E.g.,\npython $(mods)/make_genesets.py ./data/MTPDB.sqlite ./data/in/config/gene_lists/basic.json \\\n\t./data/genesets.json ./data/genesets_repr.txt \\\n\t--min_pop_score 0.7 \\\n\t--min_set_size 15 \\\n\t--min_recurse_set_size 20 \\\n\t--prune_direction $(PRUNE_DIRECTION) \\\n\t--prune_similarity $(PRUNE_SIMILARITY) \\\n\t--verbose\n```\n\n#### 3. Make the union of the genesets following the structure\nAll gene sets are merged together into a large tree structure, then the\n`peune()` function is used to remove redundancy. The two parameters of `prune()`\nfunction (`similarity` and `direction`) are set in the\n`./data/in/config/heatmaps_runtime_options.json` file, with the following\ndefaults:\n```json\n\"prune_similarity\": 0.5,\n\"prune_direction\": \"bottomup\",\n```\n\n\n\n\nNotes on DESeq2\n\n1. You have to set the option `minReplicatesForReplace` to `Inf` in `DESeq` in\n\torder to never replace outliers (and so have the baseMeans exactly equal to\n\tthe mean of the MOR-normalized counts across **all** samples)\n\t`dds2 \u003c- DESeq2::DESeq(dds, minReplicatesForReplace = Inf)`\n1. **log2 Fold Change in DESeq2 is not identical to FC calculated from\n\tnormalized count** (https://support.bioconductor.org/p/p134193/)\n\t_...turning off fold change shrinkage should make log2foldchange from\n\tDESeq2 be simply equal to (mean of normalized counts group B) / (mean of\n\tnormalized counts group A). However, it seems that some degree of fold\n\tchange moderation is done even when betaPrior is False._\n\t**It's not always equal to the ratio of the mean of normalized counts\n\tdepending on the fit of the GLM, but close (when no other factors are\n\tpresent in the design).** Michael Love\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftcp-lab%2Ftransportome_profiler_test","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftcp-lab%2Ftransportome_profiler_test","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftcp-lab%2Ftransportome_profiler_test/lists"}