https://github.com/rajnandinithopte/machine-learning_time-series-analysis

This project applies feature engineering and logistic regression for time series classification, optimizing performance through feature selection and cross-validation. It explores both binary and multi-class classification using sensor data.
https://github.com/rajnandinithopte/machine-learning_time-series-analysis

Last synced: 3 days ago
JSON representation

Host: GitHub
URL: https://github.com/rajnandinithopte/machine-learning_time-series-analysis
Owner: rajnandinithopte
Created: 2025-02-03T08:39:36.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-02-03T21:01:16.000Z (3 months ago)
Last Synced: 2025-02-17T21:23:54.623Z (3 months ago)
Language: Jupyter Notebook
Homepage:
Size: 3.8 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Machine Learning: Time Series Analysis

# 🔷 Time Series Classification and Logistic Regression

## 🔶 Overview

This project involves **time series classification** using data from the **AReM dataset**, which consists of **sensor readings from human activities**. The main tasks include **feature extraction, binary classification using logistic regression, and multi-class classification using penalized regression techniques**.

---

## **🔷 Libraries Used**

- **NumPy, Pandas** - Data manipulation and feature engineering.

- **Matplotlib, Seaborn** - Data visualization for scatter plots and distribution analysis.

- **SciPy, Bootstrap** - Statistical analysis and confidence interval estimation.

- **Scikit-learn** - Logistic regression, cross-validation, feature selection, and model evaluation.

---

## **🔷 Dataset Description**

- The **AReM dataset** consists of **sensor readings from seven human activities**.

- Each activity contains **multiple instances**, where each instance is a time series of **six sensor readings**:

  - **avg_rss12, var_rss12, avg_rss13, var_rss13, avg_rss23, var_rss23**

- Each time series has **480 time points** per instance.

- **Training and Test Split**:

  - **Training Set**: Excludes first 1-2 instances from each activity.

  - **Test Set**: First 1-2 instances of "bending" activities and first 1-3 instances of others.

---

## **🔷 Steps Taken to Accomplish the Project**

### **🔶 1. Data Preprocessing and Feature Engineering**

- Downloaded the **AReM dataset** containing **sensor readings from seven human activities**.

- Cleaned the dataset to remove inconsistencies and missing values.

- Extracted **time-domain features** for each sensor signal, including:

  - Minimum, Maximum, Mean, Median

  - Standard Deviation, First Quartile, Third Quartile

- Constructed a **new dataset** where each row corresponds to an instance with extracted features.

### **🔶 2. Statistical Analysis**

- Estimated the **standard deviation** of each feature.

- Used **bootstrapping methods** to compute **90% confidence intervals** for feature variability.

- Selected the **three most important features** using domain knowledge and statistical analysis.

### **🔶 3. Binary Classification with Logistic Regression**

- Created a **binary classification task** to distinguish **"bending" activity from others**.

- **Visualized feature distributions** using scatter plots to assess separability.

- Experimented with **different feature transformations** to improve class separation.

### **🔶 4. Experimenting with Time Series Splitting**

- Split each time series into **two equal parts** and repeated the classification process.

- Extended the experiment by splitting time series into **l ∈ {1,2,…,20}** sub-series.

- Used **logistic regression** to classify bending vs. non-bending activities for each split.

- Evaluated different feature selection methods:

  - **P-values from logistic regression coefficients**

  - **Recursive Feature Elimination (RFE)**

  - **Backward feature selection**

  

### **🔶 5. Model Selection and Cross-Validation**

- Applied **5-fold cross-validation** to optimize the parameters **(l, p)**:

  - **l** = number of time series splits

  - **p** = number of selected features

- Used **stratified cross-validation** to handle potential **class imbalances**.

### **🔶 6. Evaluation Metrics**

- Reported:

  - **Confusion Matrix**

  - **ROC Curve and AUC Score**

  - **Optimal logistic regression parameters (βi’s)**

  - **Feature importance and statistical significance**

- Compared **test accuracy** against **cross-validation performance**.

### **🔶 7. Handling Class Imbalance**

- Analyzed class separability to detect possible **instability** in logistic regression.

- If imbalanced classes were found:

  - **Implemented case-control sampling** to balance class representation.

  - Adjusted parameters accordingly and **re-evaluated model performance**.

### **🔶 8. L1-Penalized Logistic Regression**

- Compared **feature selection using p-values** vs. **L1-regularization (LASSO)**.

- Performed **cross-validation for both l (time series splits) and λ (L1 penalty)**.

- Compared **L1-penalized logistic regression** with traditional feature selection methods.

### **🔶 9. Multi-Class Classification**

- Trained an **L1-penalized multinomial regression model** to classify all activities.

- Evaluated performance using **confusion matrices** and **multi-class ROC curves**.

- Compared the logistic regression model against a **Naïve Bayes classifier** using:

  - **Gaussian priors**

  - **Multinomial priors**

- Determined the **best classification method** for this problem.

---

## 📌 **Note**

This repository contains a **Jupyter Notebook** detailing each step, along with **results and visualizations**.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rajnandinithopte/machine-learning_time-series-analysis

Awesome Lists containing this project

README