Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/andreeo/model-smoking-dna-methylation
Predicting the influence of smoking on DNA methylation at different CpG islands
https://github.com/andreeo/model-smoking-dna-methylation
csv data-mining gridsearchcv machine-learning metrics model pandas python random-forest-classifier sklearn svc
Last synced: 8 days ago
JSON representation
Predicting the influence of smoking on DNA methylation at different CpG islands
- Host: GitHub
- URL: https://github.com/andreeo/model-smoking-dna-methylation
- Owner: andreeo
- License: mit
- Created: 2024-09-14T22:41:27.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-09-14T23:53:32.000Z (5 months ago)
- Last Synced: 2024-11-21T17:06:39.662Z (2 months ago)
- Topics: csv, data-mining, gridsearchcv, machine-learning, metrics, model, pandas, python, random-forest-classifier, sklearn, svc
- Language: Jupyter Notebook
- Homepage:
- Size: 73.2 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# model-smoking-dna-methylation
This repository contains the code and data for predicting the influence of smoking on DNA methylation at different CpG islands. The
code is written in Python and uses the sklearn library for machine learning. The data is stored in csv file format and is available in
the same repository.## Definition of the problem and initial preparation
- Question: How does smoking influence DNA methylation at different CpG islands?
- Data selection and preparation:
- gsm, smoking status, gender, age, and DNA methylation data
- Target variable:
- Smoking status
- Evaluation metrics:
- Accurary
- Precision
- Recall
- F1 score## Preparation of the data
A dataset was used that includes information on smoking status, gender, age, and methulation values at specific CpG islands. The
following steps were performed for data preparation:1. Removal of the 'GSM' column as it was not relevant for the analysis.
2. Normalization of the values in the 'Gender' column to make them uniform.
3. Imputation of missing values in the methylation columns.
4. Coding of categorial variables using LabelEncoder.
5. Normalization of methylation data using StandardScale.## Feature selection and reduction of attributes
To improve the performance and reduction of dimensionality of dataset, attribute selection and reduction were implemented:
1. VarianceThreshold: The features with variance zero were removed.
2. SelectKBest: The top 10 features were selected.
3. PCA: THE PCA was used to reduce the dimensionality to 10 principal components.## Training and Evaluation of the Base Model
A data split was performed to train with 80% of the data and 20% for testing. In this way, if we want to test our model, we use new
data that we have not trained with.## Optimization of the model
The model was optimized using GridSearchCV to find the best hyperparameters for the model.
## Models used and evaluation
The following models were used for the prediction:
- Logistic Regression
- Random Forest
- SVMThe results obtained were the following:
- Logistic Regression (Base Model):
- Accuracy: 0.74
- Precision: 0.75
- Recall: 0.74
- F1 score: 0.66- Random Forest (Advanced Model):
- Accuracy: 0.74
- Precision: 0.71
- Recall: 0.74
- F1 score: 0.70- SVM (Advanced Model):
- Accuracy: 0.73
- Precision: 0.70
- Recall: 0.73
- F1 score: 0.66### Acurracy
We can see that all three models show similar results, with an accuracy around 74%. THis indicated that all models have comparable
performance in terms of correctly classifying smokers and non-smokers. based on DNA methylation.### Precision and F1-score
Although the models shae accuracy, they show differences in these evaluation metrics. The Random Forest model, although it has the
same accuracy as Logistic Regression, shows a better F1-sscore, which indicatees that it suggests a better balance between precision
and recall.### SVM
This model has a lowe accuracy compared to the other models, Its other metrics are also lower, which indicates, in this case, that SVM is not the best model for this dataset.
## CONCLUSION
It has been shown that DNA in specific CpG islands can be used to predict smoking status with reasonable accuracy using different
models.
It has also been shown that the Random Forest model has shown a better balance between the evaluation metrics, aking it the (slightly) best model.Optimiztion techniques and several models have also been used to obtain the best possible results. It should also be noted that these
models have a lot of room for improvement.## Future work
- Use more advanced models for prediction.
- Use more data for training.
- Use more features for prediction.## Develop by @andreeo