https://github.com/thevinh-ha-1710/rstudio-statistics
This project deeply studies 2 datasets using applied statistics techniques.
https://github.com/thevinh-ha-1710/rstudio-statistics
applied-statistics data-analysis data-science data-visualization rmarkdown rstudio
Last synced: 5 months ago
JSON representation
This project deeply studies 2 datasets using applied statistics techniques.
- Host: GitHub
- URL: https://github.com/thevinh-ha-1710/rstudio-statistics
- Owner: TheVinh-Ha-1710
- Created: 2023-11-11T15:54:58.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-23T23:31:24.000Z (9 months ago)
- Last Synced: 2025-03-06T06:36:41.862Z (9 months ago)
- Topics: applied-statistics, data-analysis, data-science, data-visualization, rmarkdown, rstudio
- Homepage:
- Size: 897 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# RStudio Statistical Analysis Project
## Project Overview
This project involves conducting deep statistical analysis using RStudio on traffic and cake datasets. The study explores relationships between variables using regression models, evaluates model significance with statistical tests, and visualizes results using various plots. The project also includes a comprehensive study report created using RMarkdown.
## Datasets Used
- **Traffic Dataset**: Used to analyze relationships between different predictors and the response variable `spi`.
- **Cake Dataset**: Used for a two-way ANOVA study to examine the effects of `Recipe` and `Temp` on `Angle`.
## Key Analysis Steps
### **Traffic Dataset Analysis**
1. **Exploratory Data Analysis (EDA)**
- Correlation analysis with correlation matrices and scatterplot matrices.
- Identification of relationships between predictors and the response variable `spi`.
2. **Regression Model Fitting**
- Full multiple linear regression model `lm(spi ~ ., data = traffic)`.
- 95% confidence interval estimation for `weather` impact.
- F-test for overall regression significance.
3. **Model Diagnostics**
- Residual plots and normality tests.
- Finding the best regression model through backward elimination.
- Comparing adjusted R² values to determine optimal model selection.
### **Cake Dataset Analysis**
1. **Balanced Study Checking**
- Ensuring equal number of replicates across factor levels.
2. **Preliminary Data Visualization**
- Boxplots and interaction plots to analyze factor relationships.
3. **Two-Way ANOVA Analysis**
- Interaction model fitting to assess significance.
- Model validation through diagnostic plots.
4. **Main Effects Analysis**
- Hypothesis testing for `Recipe` and `Temp` significance.
- Final model selection without interaction term based on p-values.
## Tools & Libraries Used
- **RStudio** for data analysis and statistical modeling.
- **Base R functions** for data manipulation and visualization.
- **ggplot2 & Base R plotting functions** for visual representation.
- **RMarkdown** for creating a reproducible statistical report.
## Results & Findings
- A significant relationship was found between `spi` and predictors `weather`, `transport`, and `road` in the traffic dataset.
- The final regression model explained **73.8% of variation** in `spi`.
- The cake dataset analysis found that `Recipe` and `Temp` had significant main effects on `Angle`, but their interaction was not significant.
## How to Run the Project
1. Ensure R and RStudio are installed.
2. Place datasets (`traffic.csv`, `cake.csv`) inside the `data/` folder.
3. Open `report.Rmd` in RStudio.
4. Click **Knit** to generate the PDF report.