Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/andreeo/model-smoking-dna-methylation

Predicting the influence of smoking on DNA methylation at different CpG islands
https://github.com/andreeo/model-smoking-dna-methylation

csv data-mining gridsearchcv machine-learning metrics model pandas python random-forest-classifier sklearn svc

Last synced: 8 days ago
JSON representation

Predicting the influence of smoking on DNA methylation at different CpG islands

Host: GitHub
URL: https://github.com/andreeo/model-smoking-dna-methylation
Owner: andreeo
License: mit
Created: 2024-09-14T22:41:27.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-09-14T23:53:32.000Z (5 months ago)
Last Synced: 2024-11-21T17:06:39.662Z (2 months ago)
Topics: csv, data-mining, gridsearchcv, machine-learning, metrics, model, pandas, python, random-forest-classifier, sklearn, svc
Language: Jupyter Notebook
Homepage:
Size: 73.2 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# model-smoking-dna-methylation

This repository contains the code and data for predicting the influence of smoking on DNA methylation at different CpG islands. The
code is written in Python and uses the sklearn library for machine learning. The data is stored in csv file format and is available in
the same repository.

## Definition of the problem and initial preparation

- Question: How does smoking influence DNA methylation at different CpG islands?
- Data selection and preparation:
- gsm, smoking status, gender, age, and DNA methylation data
- Target variable:
- Smoking status
- Evaluation metrics:
- Accurary
- Precision
- Recall
- F1 score

## Preparation of the data

A dataset was used that includes information on smoking status, gender, age, and methulation values at specific CpG islands. The
following steps were performed for data preparation:

1. Removal of the 'GSM' column as it was not relevant for the analysis.
2. Normalization of the values in the 'Gender' column to make them uniform.
3. Imputation of missing values in the methylation columns.
4. Coding of categorial variables using LabelEncoder.
5. Normalization of methylation data using StandardScale.

## Feature selection and reduction of attributes

To improve the performance and reduction of dimensionality of dataset, attribute selection and reduction were implemented:

1. VarianceThreshold: The features with variance zero were removed.
2. SelectKBest: The top 10 features were selected.
3. PCA: THE PCA was used to reduce the dimensionality to 10 principal components.

## Training and Evaluation of the Base Model

A data split was performed to train with 80% of the data and 20% for testing. In this way, if we want to test our model, we use new
data that we have not trained with.

## Optimization of the model

The model was optimized using GridSearchCV to find the best hyperparameters for the model.

## Models used and evaluation

The following models were used for the prediction:

- Logistic Regression
- Random Forest
- SVM

The results obtained were the following:

- Logistic Regression (Base Model):

- Accuracy: 0.74
- Precision: 0.75
- Recall: 0.74
- F1 score: 0.66

- Random Forest (Advanced Model):

- Accuracy: 0.74
- Precision: 0.71
- Recall: 0.74
- F1 score: 0.70

- SVM (Advanced Model):

- Accuracy: 0.73
- Precision: 0.70
- Recall: 0.73
- F1 score: 0.66

### Acurracy

We can see that all three models show similar results, with an accuracy around 74%. THis indicated that all models have comparable
performance in terms of correctly classifying smokers and non-smokers. based on DNA methylation.

### Precision and F1-score

Although the models shae accuracy, they show differences in these evaluation metrics. The Random Forest model, although it has the
same accuracy as Logistic Regression, shows a better F1-sscore, which indicatees that it suggests a better balance between precision
and recall.

### SVM

This model has a lowe accuracy compared to the other models, Its other metrics are also lower, which indicates, in this case, that SVM is not the best model for this dataset.

## CONCLUSION

It has been shown that DNA in specific CpG islands can be used to predict smoking status with reasonable accuracy using different
models.
It has also been shown that the Random Forest model has shown a better balance between the evaluation metrics, aking it the (slightly) best model.

Optimiztion techniques and several models have also been used to obtain the best possible results. It should also be noted that these
models have a lot of room for improvement.

## Future work

- Use more advanced models for prediction.
- Use more data for training.
- Use more features for prediction.

## Develop by @andreeo