https://github.com/marta-barea/tic-hs-gc-ims-waters-ml
Machine learning-based analytical approach for processing TIC data from contaminated water samples with PDPs, obtained through HS-GC-IMS
https://github.com/marta-barea/tic-hs-gc-ims-waters-ml
analytical-chemistry chemistry chemoinformatics chemometrics chromatography headspace ims machine-learning machine-learning-algorithms petroleum waters
Last synced: 4 months ago
JSON representation
Machine learning-based analytical approach for processing TIC data from contaminated water samples with PDPs, obtained through HS-GC-IMS
- Host: GitHub
- URL: https://github.com/marta-barea/tic-hs-gc-ims-waters-ml
- Owner: Marta-Barea
- License: gpl-3.0
- Created: 2025-02-12T15:23:47.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-12T16:15:48.000Z (4 months ago)
- Last Synced: 2025-02-12T16:32:38.207Z (4 months ago)
- Topics: analytical-chemistry, chemistry, chemoinformatics, chemometrics, chromatography, headspace, ims, machine-learning, machine-learning-algorithms, petroleum, waters
- Language: R
- Homepage:
- Size: 2.32 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Discrimination of PDPs in Water Samples: A Machine Learning Approach Using TIC Data (HS-GC-IMS)
## ๐ Overview
This repository contains the code and workflows for the analysis of hydrocarbon contamination in water samples using Headspace Gas Chromatography-Ion Mobility Spectrometry (HS-GC-IMS) in combination with machine learning techniques. The project aims to develop classification models for detecting and differentiating hydrocarbons in complex aqueous matrices.
---
## ๐ Project Structure
The repository is structured as follows:
```
โโโ figures/ # Generated figures from data analysis
โโโ scripts/ # Contains scripts for data analysis
โ โโโ eda # Exploratory Data Analysis (EDA)
โ โโโ missing_values_hsgcims_waters.R
โ โโโ outliers_hsgcims_waters.R
โ โโโ tic # Total Ion Chromatogram (TIC) ploting
โ โโโ tic_hsgcims_waters.R
โ โโโ unsupervised_algorithms # Unsupervised Machine Learning (HCA, PCA)
โ โโโ hca_hsgcims_waters.R
โ โโโ pca_hsgcims_waters.R
โ โโโ supervised_algorithms # Supervised Machine Learning (SVM, RF)
โ โโโ svm_hsgcims_waters.R
โ โโโ rf_hsgcims_waters.R
โโโ requirements.txt # Required R packages
โโโ README.md # Project documentation
โโโ .gitignore # Ignore file
โโโ LICENSE # License file
```---
## ๐ Workflow
The data analysis workflow follows these main steps:
1. Exploratory Data Analysis (EDA)
- Detection of missing values and outliers
- Visualization of Total Ion Chromatogram (TIC) profiles based on contamination classes
- Exploratory assessment of the dataset using Hierarchical Clustering Analysis (HCA) and Principal Component Analysis (PCA)
2. Supervised Machine Learning
Random Forest (RF) and Support Vector Machine (SVM) models are developed for classification. Three feature selection strategies are applied:
- Using all predictors
- Selection via the Boruta algorithm
- Selection via Genetic Algorithm (GA)Standard machine learning workflow applied:
- Splitting the dataset into training (70%) and test (30%) sets.
- Feature extraction is performed exclusively on the training set.
- Data is scaled and centered after splitting and feature extraction.
- Class balancing strategies such as undersampling and SMOTE are applied to the training set.
- Model training includes hyperparameter tuning and cross-validation.
- Evaluation is conducted on the test set to assess final performance.---
## ๐ฅ๏ธ Software and Dependencies
The analysis is conducted in R (v4.4.0) within Visual Studio Code. The required R packages are specified in *requirements.txt*, and include:
- **Data handling**: readxl (v1.4.3), writexl (v1.5.1), dplyr (v1.1.4), tidyr (v1.3.1), data.table (v1.16.2)
- **Visualization**: ggplot2 (v3.5.1), ggcorrplot (v0.1.4.1), gridExtra (v2.3), viridis (v0.6.5), ggrepel (v0.9.6), reshape2 (v1.4.4)
- **Clustering & Dimensionality Reduction**: stats, factoextra (v1.0.7), cluster (v2.1.6)
- **Feature Selection**: Boruta (v8.0.0), GA (v3.2.4), recipes (v1.1.0), themis (v1.0.2)
- **Machine Learning**: caret (v6.0.94), doParallel (v1.0.17)---
## ๐ How to Use the Repository
1. Clone the repository
```bash
git clone https://github.com/Marta-Barea/tic-hs-gc-ims-waters-ml.git
cd tic-hs-gc-ims-waters-ml
```2. Set up the R environment
Ensure you have R (v4.4.0 or later) installed. Install the required packages using the requirements.txt file.
___
## ๐ค Collaboration
This project is conducted in collaboration with the Department of Analytical Chemistry, University of Cรกdiz (UCA).
---
## ๐ License
This project is licensed under the GNU GENERAL PUBLIC License. See `LICENSE` for details.