An open API service indexing awesome lists of open source software.

https://github.com/marta-barea/tic-hs-gc-ims-waters-ml

Machine learning-based analytical approach for processing TIC data from contaminated water samples with PDPs, obtained through HS-GC-IMS
https://github.com/marta-barea/tic-hs-gc-ims-waters-ml

analytical-chemistry chemistry chemoinformatics chemometrics chromatography headspace ims machine-learning machine-learning-algorithms petroleum waters

Last synced: 4 months ago
JSON representation

Machine learning-based analytical approach for processing TIC data from contaminated water samples with PDPs, obtained through HS-GC-IMS

Awesome Lists containing this project

README

        

# Discrimination of PDPs in Water Samples: A Machine Learning Approach Using TIC Data (HS-GC-IMS)

## ๐Ÿ“Œ Overview

This repository contains the code and workflows for the analysis of hydrocarbon contamination in water samples using Headspace Gas Chromatography-Ion Mobility Spectrometry (HS-GC-IMS) in combination with machine learning techniques. The project aims to develop classification models for detecting and differentiating hydrocarbons in complex aqueous matrices.

---

## ๐Ÿ“‚ Project Structure

The repository is structured as follows:

```
โ”œโ”€โ”€ figures/ # Generated figures from data analysis
โ”œโ”€โ”€ scripts/ # Contains scripts for data analysis
โ”‚ โ”œโ”€โ”€ eda # Exploratory Data Analysis (EDA)
โ”‚ โ”œโ”€โ”€ missing_values_hsgcims_waters.R
โ”‚ โ”œโ”€โ”€ outliers_hsgcims_waters.R
โ”‚ โ”œโ”€โ”€ tic # Total Ion Chromatogram (TIC) ploting
โ”‚ โ”œโ”€โ”€ tic_hsgcims_waters.R
โ”‚ โ”œโ”€โ”€ unsupervised_algorithms # Unsupervised Machine Learning (HCA, PCA)
โ”‚ โ”œโ”€โ”€ hca_hsgcims_waters.R
โ”‚ โ”œโ”€โ”€ pca_hsgcims_waters.R
โ”‚ โ”œโ”€โ”€ supervised_algorithms # Supervised Machine Learning (SVM, RF)
โ”‚ โ”œโ”€โ”€ svm_hsgcims_waters.R
โ”‚ โ”œโ”€โ”€ rf_hsgcims_waters.R
โ”œโ”€โ”€ requirements.txt # Required R packages
โ”œโ”€โ”€ README.md # Project documentation
โ”œโ”€โ”€ .gitignore # Ignore file
โ”œโ”€โ”€ LICENSE # License file
```

---

## ๐Ÿ”„ Workflow

The data analysis workflow follows these main steps:

1. Exploratory Data Analysis (EDA)

- Detection of missing values and outliers

- Visualization of Total Ion Chromatogram (TIC) profiles based on contamination classes

- Exploratory assessment of the dataset using Hierarchical Clustering Analysis (HCA) and Principal Component Analysis (PCA)

2. Supervised Machine Learning

Random Forest (RF) and Support Vector Machine (SVM) models are developed for classification. Three feature selection strategies are applied:

- Using all predictors
- Selection via the Boruta algorithm
- Selection via Genetic Algorithm (GA)

Standard machine learning workflow applied:

- Splitting the dataset into training (70%) and test (30%) sets.
- Feature extraction is performed exclusively on the training set.
- Data is scaled and centered after splitting and feature extraction.
- Class balancing strategies such as undersampling and SMOTE are applied to the training set.
- Model training includes hyperparameter tuning and cross-validation.
- Evaluation is conducted on the test set to assess final performance.

---

## ๐Ÿ–ฅ๏ธ Software and Dependencies

The analysis is conducted in R (v4.4.0) within Visual Studio Code. The required R packages are specified in *requirements.txt*, and include:

- **Data handling**: readxl (v1.4.3), writexl (v1.5.1), dplyr (v1.1.4), tidyr (v1.3.1), data.table (v1.16.2)
- **Visualization**: ggplot2 (v3.5.1), ggcorrplot (v0.1.4.1), gridExtra (v2.3), viridis (v0.6.5), ggrepel (v0.9.6), reshape2 (v1.4.4)
- **Clustering & Dimensionality Reduction**: stats, factoextra (v1.0.7), cluster (v2.1.6)
- **Feature Selection**: Boruta (v8.0.0), GA (v3.2.4), recipes (v1.1.0), themis (v1.0.2)
- **Machine Learning**: caret (v6.0.94), doParallel (v1.0.17)

---

## ๐Ÿš€ How to Use the Repository

1. Clone the repository

```bash
git clone https://github.com/Marta-Barea/tic-hs-gc-ims-waters-ml.git
cd tic-hs-gc-ims-waters-ml
```

2. Set up the R environment

Ensure you have R (v4.4.0 or later) installed. Install the required packages using the requirements.txt file.

___

## ๐Ÿค Collaboration

This project is conducted in collaboration with the Department of Analytical Chemistry, University of Cรกdiz (UCA).

---

## ๐Ÿ“œ License

This project is licensed under the GNU GENERAL PUBLIC License. See `LICENSE` for details.