https://github.com/ilke-kas/multivariate-data-analysis
A curated collection of R-based data analysis projects applying regression modeling, clustering, dimensionality reduction, multivariate statistics, and classification. Each project showcases practical data science techniques, interpretability, and domain insights using real-world and academic datasets.
https://github.com/ilke-kas/multivariate-data-analysis
classification data-analysis data-visualization dimensionality-reduction machine-learning multivariate-analysis r regression statistics
Last synced: 9 months ago
JSON representation
A curated collection of R-based data analysis projects applying regression modeling, clustering, dimensionality reduction, multivariate statistics, and classification. Each project showcases practical data science techniques, interpretability, and domain insights using real-world and academic datasets.
- Host: GitHub
- URL: https://github.com/ilke-kas/multivariate-data-analysis
- Owner: ilke-kas
- Created: 2025-06-26T15:19:33.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-06-26T16:41:48.000Z (12 months ago)
- Last Synced: 2025-06-26T16:41:54.618Z (12 months ago)
- Topics: classification, data-analysis, data-visualization, dimensionality-reduction, machine-learning, multivariate-analysis, r, regression, statistics
- Language: R
- Homepage:
- Size: 8.32 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ Data Analysis Projects in R
This repository showcases a variety of statistical and machine learning techniques implemented in R. Each project involves real-world or academic datasets, and highlights my ability to perform exploratory data analysis, dimensionality reduction, regression modeling, multivariate statistics, and classification. The code includes reproducible scripts and interpretable outputs that demonstrate a solid understanding of both statistical theory and applied data science.
---
## ๐ Clustering and Dimensionality Reduction
**File:** `Clustering-and-Dimensionality-Reduction-R.pdf`
**Techniques Used:**
- Nonlinear Dimensionality Reduction: Multidimensional Scaling (MDS), Kernel PCA (polydot, laplace, RBF kernels)
- Clustering: K-Means, Elbow Method, Visual Analysis
- Data Visualization: Correlation plots, pairwise scatterplots
**Functionality:**
- Identified that PCA was inappropriate due to non-elliptical data distribution
- Applied and compared nonlinear dimensionality reduction methods
- Determined optimal number of clusters using WSS and visual interpretation
- Compared cluster separability using reduced feature spaces
---
## ๐ Regression Modelling on Retail Sales
**File:** `regression-modelling.pdf`
**Dataset:** Carseats (ISLR2)
**Techniques Used:**
- Linear Regression (first-order and second-order models)
- Model Selection via Residual Diagnostics and Adjusted Rยฒ
- Train/Test Split and Model Evaluation (RMSE, NMSE)
- Interaction Terms and Polynomial Features
**Functionality:**
- Built a linear model to predict sales from multiple features
- Evaluated residuals to assess model fit and assumptions
- Compared first- and second-order models to capture nonlinear trends
- Achieved high model performance with meaningful business insights
---
## ๐งช Multivariate Normal and Conditional Expectation
**File:** `conditional_expectation_mvn.pdf`
**Focus:** Mathematical Derivations in Multivariate Statistics
**Topics Covered:**
- Properties of Multivariate Normal (MVN) distributions
- Conditional Expectation and Variance
- Matrix Algebra in Statistical Inference
- Vector-Valued Linear Regression under MVN
**Functionality:**
- Derived expressions for E[Y|X], Var(Y|X), and regression coefficients
- Compared regression of Y on X vs. X on Y to illustrate asymmetry
- Provided interpretations in terms of covariance structures and independence
---
## ๐ฆ Insurance Cost Modeling and EDA
**File:** `insurance_cost_modelling.pdf`
**Dataset:** `insurance.csv`
**Techniques Used:**
- Exploratory Data Analysis (EDA): Histograms, Boxplots, Skim Summary
- Correlation Analysis and Visualization
- Linear Regression (with interaction terms)
- Feature Impact Interpretation
**Functionality:**
- Identified key cost drivers: smoking status, age, BMI
- Created group-level summaries by sex, region, and smoker status
- Built and interpreted multiple linear regression models
- Highlighted how interactions like `bmi ร smoker` amplify insurance costs
---
## ๐งฌ Cancer Type Classification with the Khan Dataset
**File:** `khan-dataset-classifier-comparison.pdf`
**Dataset:** Khan Gene Expression Dataset (ISLR2)
**Techniques Used:**
- High-Dimensional Classification:
- Linear Discriminant Analysis (LDA)
- Regularized Discriminant Analysis (RDA)
- Naive Bayes
- Model Evaluation (Accuracy on Test Data)
- Covariance Analysis using Frobenius Norm
**Functionality:**
- Handled 2,308-dimensional gene expression data with only 63 training samples
- Justified LDA due to covariance matrix instability in high dimensions
- Achieved 75% accuracy with LDA vs. 25% with RDA and Naive Bayes
- Explained limitations of complex models under data scarcity
---
## ๐ Conditional Expectation and Classification with Multivariate Analysis
**File:** `multivariate_detailed_classification.pdf`
**Topics Covered:**
- Conditional Expectation under MVN
- Weighted Estimation (MSE and MAE loss minimization)
- Multivariate Classification: Boxโs M Test, LDA, QDA, Logistic Regression
- ROC Analysis and Confusion Matrices
**Functionality:**
- Derived and interpreted conditional expectations using matrix notation
- Used ROC curves and accuracy metrics to evaluate classifiers
- Compared multivariate classification models under different assumptions
- Visualized results using `ggplot2`, `patchwork`, and `pROC` libraries
---
## ๐ง Skills Demonstrated
- R Programming
- Data Cleaning & Wrangling
- Linear and Nonlinear Modeling
- Dimensionality Reduction
- Classification and Discriminant Analysis
- Exploratory and Inferential Statistics
- Matrix Algebra in Statistics
- Visualization with `ggplot2`, `corrplot`, `patchwork`, `pROC`
---
## ๐ How to Use This Repo
Each project is contained as a standalone file (PDF or RMarkdown) with code, outputs, and commentary.
They reflect a variety of scenarios encountered in applied statistics and data science roles.
---
## ๐ค Author
**ฤฐlke Kaล**
PhD Researcher | Data Analyst | Machine Learning Enthusiast
[LinkedIn](https://www.linkedin.com/in/ilkekas/) ยท [GitHub](https://github.com/ilke-kas)