https://github.com/sivkri/geneexpression-machinelearning

Supervised Machine Learning for Gene Expression Analysis
https://github.com/sivkri/geneexpression-machinelearning

geneexpression logistic-regression random-forest-classifier supervised-learning supervised-machine-learning

Last synced: 3 months ago
JSON representation

Supervised Machine Learning for Gene Expression Analysis

Host: GitHub
URL: https://github.com/sivkri/geneexpression-machinelearning
Owner: sivkri
License: apache-2.0
Created: 2025-02-26T11:53:08.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-26T12:33:24.000Z (over 1 year ago)
Last Synced: 2025-03-22T19:37:32.928Z (over 1 year ago)
Topics: geneexpression, logistic-regression, random-forest-classifier, supervised-learning, supervised-machine-learning
Language: Jupyter Notebook
Homepage:
Size: 3.72 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Supervised Learning for Gene Expression Analysis

This repository showcases the application of **Logistic Regression** and **Random Forest** for gene expression analysis. It automates data processing, model training, and evaluation.

## 🚀 Features

- **Preprocessing**: Formats gene expression data

- **Machine Learning**: Uses Logistic Regression & Random Forest

- **Evaluation**: Generates accuracy reports and confusion matrices

- **Automation**: Includes a shell script & GitHub Actions

## 📂 Repository Structure

```

📂 **ml_gene_expression_project**  

 ┣ 📂 **data/**               → Stores gene expression data  

 ┣ 📂 **results/**            → Model reports and visualizations  

 ┣ 📜 **preprocess.py**       → Data processing script  

 ┣ 📜 **train.py**            → Model training script  

 ┣ 📜 **evaluate.py**         → Model evaluation with visualization  

 ┣ 📜 **run_pipeline.sh**     → Shell script to automate pipeline execution  

 ┣ 📜 **run_pipeline.yml**    → GitHub Actions workflow  

 ┣ 📜 **app.py**              → Streamlit app for interactive visualization  

 ┣ 📜 **README.md**           → Project documentation  

```

## 🏃 Run the Pipeline

```bash

bash run_pipeline.sh

```

## 🖼️ Sample Outputs

- Model Accuracy Reports in `results/`

- Confusion Matrices saved as images

## 🤖 GitHub Actions

This repository supports **automatic execution** when new data is pushed.

## 📌 How to Use

1. Clone the repository:  

   ```bash

   git clone https://github.com/sivkri/GeneExpression-MachineLearning.git

   ```

2. Navigate to the directory:  

   ```bash

   cd GeneExpression-MachineLearning

   ```

3. Run the pipeline:  

   ```bash

   bash run_pipeline.sh

   ```

## To launch the Streamlit app for interactive visualization:

   ```bash

   streamlit run app.py

   ```

### 🔥 Key Enhancements:  

✔ **Streamlit app is properly emphasized**  

✔ **Instructions for running the app are added**  

✔ **Clear explanation of app features**  

This version **sells** your **ML + Streamlit** project effectively. Let me know if you need further refinements! 🚀

## Results & Findings

- Identified top genes differentiating wild-type (WT) vs knockout (KO) conditions

- Evaluated the impact of Eltrombopag (E20) treatment

- Achieved 75% accuracy with the Random Forest classifier

# Project Overview  

This project applies **machine learning** techniques to analyze **gene expression data** under different experimental conditions. Using **logistic regression** and **random forest classifiers**, we identify genes differentially expressed due to **HuR knockout (ELAVL1 deletion)** and **Eltrombopag (E20) drug treatment**.

Additionally, a **Streamlit web application** is integrated to provide an **interactive visualization** of the results.  

## Citation  

If you use this dataset or findings, please cite the following study:  

📖 **DOI:** [10.1186/s12915-025-02131-z](https://doi.org/10.1186/s12915-025-02131-z)

---

## **Experimental Design**  

This study investigates how **HuR knockout (KO)** and **Eltrombopag (E20) treatment** influence gene expression compared to wild-type (WT) and mock treatment (DMSO).

### **Sample Groups**  

The dataset consists of the following experimental conditions:

| Sample Group | Description |

|-------------|-------------|

| **WT-DMSO** | Wild-type (WT) cells treated with mock (DMSO) |

| **WT-E20**  | Wild-type (WT) cells treated with Eltrombopag (E20) |

| **KO-DMSO** | HuR knockout (KO) cells treated with mock (DMSO) |

| **KO-E20**  | HuR knockout (KO) cells treated with Eltrombopag (E20) |

Each sample contains gene expression data across thousands of genes. **HuR (ELAVL1)** is a key **RNA-binding protein**, and its knockout may significantly alter gene expression. **Eltrombopag** is a thrombopoietin receptor agonist that may influence transcriptional programs.

---

## **Comparisons & Research Questions**  

I have performed **three key comparisons** using **supervised learning** to classify gene expression profiles.

### **1️⃣ Effect of HuR Knockout (KO vs. WT)**

- **Comparison:** **WT-DMSO vs. KO-DMSO**  

- **Objective:** Identify genes affected by HuR deletion.  

- **Machine Learning Approach:**  

  - Features: Gene expression levels  

  - Labels: WT-DMSO (class 0) vs. KO-DMSO (class 1)  

### **2️⃣ Effect of Eltrombopag in Wild-Type Cells**

- **Comparison:** **WT-DMSO vs. WT-E20**  

- **Objective:** Determine gene expression changes due to Eltrombopag in normal cells.  

- **Machine Learning Approach:**  

  - Features: Gene expression levels  

  - Labels: WT-DMSO (class 0) vs. WT-E20 (class 1)  

### **3️⃣ Effect of Eltrombopag in HuR Knockout Cells**

- **Comparison:** **KO-DMSO vs. KO-E20**  

- **Objective:** Understand the **HuR-dependent** response to Eltrombopag.  

- **Machine Learning Approach:**  

  - Features: Gene expression levels  

  - Labels: KO-DMSO (class 0) vs. KO-E20 (class 1)  

---

## **Data Processing & Machine Learning Workflow**  

1. **Preprocessing:**  

   - Normalize expression data  

   - Convert into a machine-learning-ready format  

2. **Model Training & Feature Selection:**  

   - Train **logistic regression** and **random forest** classifiers  

   - Perform **Principal Component Analysis (PCA)**  

3. **Evaluation:**  

   - Compute **accuracy, confusion matrices, classification reports**  

   - Identify **top differentially expressed genes**  

4. **Visualization & Reporting:**  

   - Generate **PCA scatter plots**  

   - Save **model performance metrics**  

---

## 📧 Contact

For queries, feel free to reach out! 🚀

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sivkri/geneexpression-machinelearning

Awesome Lists containing this project

README