{"id":29508040,"url":"https://github.com/thearpankumar/adaptation-gene-prediction","last_synced_at":"2025-07-16T04:45:48.925Z","repository":{"id":299564593,"uuid":"1003430275","full_name":"thearpankumar/Adaptation-Gene-Prediction","owner":"thearpankumar","description":"Machine Learning model that can predict whether a gene from the bacterium Deinococcus radiodurans helps it survive extreme stress (like radiation)","archived":false,"fork":false,"pushed_at":"2025-06-17T06:31:17.000Z","size":20,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-06-17T07:29:16.457Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thearpankumar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-17T06:23:04.000Z","updated_at":"2025-06-17T06:31:20.000Z","dependencies_parsed_at":"2025-06-17T07:29:26.299Z","dependency_job_id":"2c844baa-54ef-42de-924b-f0c391cd0685","html_url":"https://github.com/thearpankumar/Adaptation-Gene-Prediction","commit_stats":null,"previous_names":["thearpankumar/adaptation-gene-prediction"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/thearpankumar/Adaptation-Gene-Prediction","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thearpankumar%2FAdaptation-Gene-Prediction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thearpankumar%2FAdaptation-Gene-Prediction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thearpankumar%2FAdaptation-Gene-Prediction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thearpankumar%2FAdaptation-Gene-Prediction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thearpankumar","download_url":"https://codeload.github.com/thearpankumar/Adaptation-Gene-Prediction/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thearpankumar%2FAdaptation-Gene-Prediction/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265482450,"owners_count":23774045,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-16T04:45:47.876Z","updated_at":"2025-07-16T04:45:48.901Z","avatar_url":"https://github.com/thearpankumar.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jQbwFb-HLqWTv7pHwN9mDyGylIdVTp4w?usp=sharing)\n# Machine Learning for Adaptation Gene Prediction\n\nThis project uses machine learning to predict whether a gene from the bacterium *Deinococcus radiodurans* contributes to stress tolerance based on features derived purely from its DNA sequence. The goal is to create a computational tool that can rapidly screen genomes for candidate stress-response genes, potentially aiding in bioengineering and synthetic biology.\n\n## Table of Contents\n- [Project Goal](#project-goal)\n- [How It Works](#how-it-works)\n- [Features Engineered](#features-engineered)\n- [Machine Learning Pipeline](#machine-learning-pipeline)\n- [How to Use This Project](#how-to-use-this-project)\n  - [Prerequisites](#prerequisites)\n  - [Installation](#installation)\n  - [Running the Notebook](#running-the-notebook)\n- [File Descriptions](#file-descriptions)\n- [Example Prediction Workflow](#example-prediction-workflow)\n- [Future Work](#future-work)\n\n## Project Goal\n\nThe primary objective is to build and train a machine learning model that can classify a given gene as either a \"stress-response gene\" or a \"normal/housekeeping gene.\" Instead of relying on expensive and time-consuming laboratory experiments, this model leverages patterns in the DNA sequence itself to make predictions.\n\nThis serves as a proof-of-concept for a high-throughput screening tool to prioritize genes for further study in newly sequenced organisms, especially extremophiles.\n\n## How It Works\n\nThe project follows a classic supervised learning approach:\n\n1.  **Data Collection:** The complete annotated genome of *Deinococcus radiodurans* (`.gbff` format) is downloaded from the NCBI database.\n2.  **Labeling:** Genes are assigned one of two labels:\n    *   **`Stress` (Positive Class):** A list is created using a hybrid strategy of (a) manually curated, literature-verified stress genes and (b) programmatic searching for functional keywords like \"DNA repair,\" \"radiation resistance,\" \"chaperone,\" etc.\n    *   **`Control` (Negative Class):** A list of housekeeping genes is created using a similar strategy, identifying genes for essential functions like \"ribosomal protein\" or \"gyrase.\"\n3.  **Feature Engineering:** Each gene's DNA sequence is converted into a set of meaningful numerical features that the model can understand.\n4.  **Model Training:** An XGBoost classifier is trained on the labeled, feature-engineered dataset to learn the patterns that differentiate the two classes.\n\n## Features Engineered\n\nThe model uses a rich set of features to build its predictions:\n\n-   **Basic Sequence Properties:**\n    -   `Gene Length`: Total length in base pairs.\n    -   `GC Content`: Percentage of Guanine and Cytosine bases.\n    -   `GC3 Content`: GC content at the third position of codons.\n    -   `CG Dinucleotide Frequency`: The relative abundance of CG pairs.\n-   **Protein-Level Features:**\n    -   `Hydrophobicity (GRAVY)`: The Grand Average of Hydropathicity of the translated protein.\n    -   `Isoelectric Point`: The pH at which the translated protein has no net charge.\n-   **Pattern-Based Features:**\n    -   `Motif Frequencies`: Occurrence of known regulatory motifs.\n    -   `K-mer Frequencies`: A high-resolution sequence \"fingerprint\" based on the frequency of all possible 4-letter DNA substrings (e.g., 'AAGC', 'GATA').\n\n## Machine Learning Pipeline\n\nA robust pipeline ensures the model is trained correctly and its performance is reliable:\n\n1.  **Preprocessing (`VarianceThreshold`):** Automatically removes useless features that are constant across all samples.\n2.  **Feature Selection (`SelectKBest`):** Selects the top 100 most informative features using a statistical ANOVA F-test, reducing noise and complexity.\n3.  **Handling Class Imbalance (`SMOTE`):** Artificially balances the training data by creating synthetic examples of the rare \"Stress\" class, preventing the model from becoming biased.\n4.  **Hyperparameter Tuning (`GridSearchCV`):** Systematically tests different model configurations (e.g., `learning_rate`, `max_depth`) to find the optimal settings.\n5.  **Training (`XGBoost`):** The final model is an XGBoost classifier, a powerful gradient boosting algorithm well-suited for tabular data.\n\n## How to Use This Project\n\n### Prerequisites\n- Python 3.8+\n- Jupyter Notebook or JupyterLab\n\n### Installation\n1.  Clone this repository to your local machine:\n    ```bash\n    git clone \u003cyour-repository-url\u003e\n    cd \u003crepository-directory\u003e\n    ```\n\n2.  Install the required Python libraries using pip:\n    ```bash\n    pip install biopython scikit-learn pandas xgboost matplotlib imblearn joblib\n    ```\n\n### Running the Notebook\n1.  Launch Jupyter Notebook or JupyterLab:\n    ```bash\n    jupyter notebook\n    ```\n2.  Open the main notebook file (e.g., `gene_prediction_pipeline.ipynb`).\n3.  Execute the cells in order from top to bottom. The notebook is self-contained and will automatically:\n    - Download the necessary genome data.\n    - Perform all data processing and training steps.\n    - Save the final model and all required pipeline components.\n    - Demonstrate how to load the saved artifacts and make predictions on sample data.\n\n## File Descriptions\n\n```\n.\n├── gene_prediction_pipeline.ipynb    # Main Jupyter Notebook with all the code.\n├── README.md                         # This README file.\n└── GCF_000012145.1_ASM1214v1_genomic.gbff # Genome data (downloaded by the notebook).\n└── saved_artifacts/\n    ├── stress_gene_model.joblib      # The final, trained XGBoost model.\n    ├── feature_selector.joblib       # The fitted SelectKBest object.\n    ├── variance_thresholder.joblib   # The fitted VarianceThreshold object.\n    ├── kmer_vectorizer.joblib        # The fitted CountVectorizer for k-mers.\n    └── feature_columns.joblib        # The list of feature names the model expects.\n```\n\n## Example Prediction Workflow\n\nAfter running the main notebook once, you can use the saved artifacts to predict on a new DNA sequence:\n\n1.  **Load Artifacts:** Load the model, selector, thresholder, vectorizer, and column list using `joblib`.\n2.  **Process New Sequence:** Apply the exact same feature engineering steps to your new sequence.\n3.  **Align Features:** Use `.reindex()` to ensure the new feature vector has the same columns in the same order as the training data.\n4.  **Apply Preprocessors:** Transform the data using the loaded `variance_thresholder` and `feature_selector`.\n5.  **Predict:** Use `loaded_model.predict()` to get the final classification.\n\nAn example of this entire workflow is provided in the final section of the main Jupyter Notebook.\n\n## Future Work\n- **Expand the Dataset:** Incorporate data from other extremophiles (e.g., tardigrades, thermophiles) to create a more general and robust model.\n- **Explore More Features:** Engineer additional features, such as codon adaptation index (CAI) or predicted protein secondary structure.\n- **Try Different Models:** Experiment with other algorithms like LightGBM or deep learning models (e.g., Convolutional Neural Networks) to compare performance.\n- **Deploy as a Web App:** Package the model and prediction pipeline into a simple web application where users can paste a DNA sequence and get a prediction.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthearpankumar%2Fadaptation-gene-prediction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthearpankumar%2Fadaptation-gene-prediction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthearpankumar%2Fadaptation-gene-prediction/lists"}