https://github.com/abhi18av/ljmu_masters_dissertation
https://github.com/abhi18av/ljmu_masters_dissertation
drug-resistance-prediction h2oai machine-learning python tuberculosis
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/abhi18av/ljmu_masters_dissertation
- Owner: abhi18av
- License: bsd-3-clause
- Created: 2020-05-07T15:01:45.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-11-28T16:48:01.000Z (over 5 years ago)
- Last Synced: 2025-01-05T10:08:10.530Z (over 1 year ago)
- Topics: drug-resistance-prediction, h2oai, machine-learning, python, tuberculosis
- Language: Jupyter Notebook
- Homepage: https://zenodo.org/badge/latestdoi/262081640
- Size: 19.6 MB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Application of ML for DRP using WGS data on MTB genomes.
==============================
This repository contains the code for my masters dissertation.
[](https://zenodo.org/badge/latestdoi/262081640)
To execute the code, the following execution environments are recommended.
1. AWS/Azure Batch for genomic pre-processing.
2. Azure ML Studio for notebooks, with a decent server.
The rest of the instructions are embedded within the `notebooks/FINAL/*ipynb` notebooks.
Project Organization
------------
├── LICENSE
├── README.md
│
├── conda_enviroment.yml <- The minimal conda file needed to recreate the environment.
├── azure_enviroment.yml <- The conda file for the Azure ML studio.
│
├── data
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── models
│ ├── ALL_FEATURES <- Models trained on All features.
│ │ ├── FINAL
│ │
│ └── PCA300 <- Models trained on PCA300 features.
│
├── notebooks
│ ├── FINAL <- The final jupyter notebooks, named as per their execution order.
│ └── 001_feature_engineering.ipynb
│ └── 002_choose_limited_tbportals_genomes.ipynb <- Contains the SRA IDs of genomes, can be downloaded through download.nf
│ └── 003_eda_mono_resistance.ipynb
│ └── 004_model_grids.ipynb
│ └── 005_stacked_ensemble.ipynb
│ └── 006_pca_based_ml.ipynb
│ └── 007_model_inspection_with_without_pca.ipynb
│
├── src
│ ├── genomic_preprocessing <- Scripts for genomic pre-processing
│ └── nyu_gatk.sh
│ └── download.nf
│ └── bwa.nf
│ └── fastqc.nf
│ └── gatk.nf
│ └── picard.nf
│ └── samtools.nf
│ └── tb_profiler.nf
│ └── trimmomatic.nf
│
│
│ ├── features <- Scripts to turn raw VCF data into tabular data for modeling
│ └── 01_tbprofiler.py
│ └── 02_vcf_drop_cols.py
│ └── 03_filter_unique_snps.py
│ └── 04_binarize_vcf.py
│ └── 05_final_snp_df.py
--------