https://github.com/gexijin/inspect
A general approach to EDA
https://github.com/gexijin/inspect
Last synced: 3 months ago
JSON representation
A general approach to EDA
- Host: GitHub
- URL: https://github.com/gexijin/inspect
- Owner: gexijin
- Created: 2023-10-30T18:47:16.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-22T02:34:54.000Z (3 months ago)
- Last Synced: 2024-12-01T01:53:13.727Z (3 months ago)
- Language: HTML
- Size: 2.94 MB
- Stars: 28
- Watchers: 2
- Forks: 20
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- jimsghstars - gexijin/inspect - A general approach to EDA (HTML)
README
# Inspect: An R Package for automated EDA (exploratory analysis)
Writtend mostly by GPT-4, this R package renders an EDA report based on this [R Markdown file.](https://raw.githubusercontent.com/gexijin/inspect/main/inst/eda.Rmd) It can be used to generate an EDA [report like this,](https://rpubs.com/ge600/eda) from any data set. You can also generate this report using the Shiny app [RTutor.](https://RTutor.ai) Contact or feedback [Steven Ge](https://www.linkedin.com/in/steven-ge-ab016947/)# Install & use
```
library("remotes")
install_github("gexijin/inspect")
library(inspect)eda(mtcars) # Generate EDA report for a data frame, i.e. mtcars
eda(iris, "Species") # Specifying a dependent/target variable
```
# Main goal
Exploratory data analysis (EDA) is an essential first step in any data science project. Consider it the equivalent of an annual doctor’s check-up but for data science projects. I have long believed that EDA can be automated as the tasks are very general. While there are existing R packages for EDA such as DataExplorer, summarytools, tableone, and GGally, I have not found what I was looking for. Leveraging GPT-4, I was able to create an EDA script in just a few hours.Given a data set, the main idea is to streamline these steps:
1. Starts with a data summary.
2. Any missing values and outliers?
3. Plots distribution of numerical variables using histograms and QQ plots. When excessive skewness is present, a log transformation is recommended.
4. Distribution of categorical variables.
5. It provides a general data overview with a heatmap and a correlation plot.
6. Correlation matrix (corrplot)
7. Scatter plots to examine correlations between numerical variables.
8. It uses violin plots and performs ANOVA to study the differences between groups delineated by categorical variables.
9. Are categorical variables independent of each other? Uses Chi-squared test and bar plots.To use this RMarkdown file, you just need to obtain a copy from this GitHub repository. Replace the demo data file with your own, specify a target variable, and you’re ready to render the report.
If that sounds like too much work, simply upload your data file to [RTutor.ai](https://RTutor.ai), and click on the EDA tab. A comprehensive report will be generated in 2 minutes. The template was originally written for RTutor.
# Example plots
data:image/s3,"s3://crabby-images/65da5/65da5f156dd6aacb090c9a64f2e188ae771a1d78" alt="Missing"
##
data:image/s3,"s3://crabby-images/a1644/a1644d253a26eb7d31a228995b130b96ebe1eb6a" alt="Correlation"##
data:image/s3,"s3://crabby-images/4f480/4f4802be42447bc6f7df049c9879eb8a8c02f1ac" alt="Heatmap"
##
data:image/s3,"s3://crabby-images/765a5/765a501a03ac5883c6404eda4cd38610183b1e4b" alt="Histogram"
##
data:image/s3,"s3://crabby-images/1c259/1c259d0400815445512ee91ed449f40e86b18fd4" alt="Barplot"
##
data:image/s3,"s3://crabby-images/125bd/125bdb91312bff18e5fb4075dae107a8da8a262c" alt="Scatter plot"
##
data:image/s3,"s3://crabby-images/b4b13/b4b13716e62f3e685de9085ed2c89c10696bb87a" alt="Boxplot"
##
data:image/s3,"s3://crabby-images/2e5a7/2e5a77e6274ff693eae2251fdb73c24bea47463a" alt="Combination"
Hello