{"id":13707887,"url":"https://github.com/daya6489/SmartEDA","last_synced_at":"2025-05-06T04:31:07.506Z","repository":{"id":56934878,"uuid":"184899407","full_name":"daya6489/SmartEDA","owner":"daya6489","description":"a R package for data exploratory analysis","archived":false,"fork":false,"pushed_at":"2024-01-30T17:53:46.000Z","size":18421,"stargazers_count":43,"open_issues_count":1,"forks_count":15,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-05-03T06:44:52.015Z","etag":null,"topics":["analysis","exploratory-data-analysis"],"latest_commit_sha":null,"homepage":"https://daya6489.github.io/SmartEDA/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daya6489.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-05-04T13:37:09.000Z","updated_at":"2025-04-14T09:06:52.000Z","dependencies_parsed_at":"2024-02-04T20:02:03.194Z","dependency_job_id":"6424032c-ed8d-43cc-8f4a-58977676eb76","html_url":"https://github.com/daya6489/SmartEDA","commit_stats":{"total_commits":32,"total_committers":3,"mean_commits":"10.666666666666666","dds":0.15625,"last_synced_commit":"c72fa59fc9fdff7190c8bbe06dce3914ac840ba9"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daya6489%2FSmartEDA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daya6489%2FSmartEDA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daya6489%2FSmartEDA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daya6489%2FSmartEDA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daya6489","download_url":"https://codeload.github.com/daya6489/SmartEDA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252621875,"owners_count":21777884,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","exploratory-data-analysis"],"created_at":"2024-08-02T22:01:46.987Z","updated_at":"2025-05-06T04:31:07.499Z","avatar_url":"https://github.com/daya6489.png","language":"HTML","funding_links":[],"categories":["HTML"],"sub_categories":[],"readme":"# SmartEDA [![CRAN status](https://www.r-pkg.org/badges/version/SmartEDA)](https://cran.r-project.org/package=SmartEDA) \u003cimg src=\"man/figures/smarteda_logo.png\" align=\"right\" width=\"130\" height=\"150\"/\u003e\n\n[![Downloads](http://cranlogs.r-pkg.org/badges/SmartEDA)](https://cran.r-project.org/package=SmartEDA)\n[![status](https://joss.theoj.org/papers/10.21105/joss.01509/status.svg)](https://joss.theoj.org/papers/10.21105/joss.01509)\n[![Total Downloads](http://cranlogs.r-pkg.org/badges/grand-total/SmartEDA)](https://cran.r-project.org/package=SmartEDA)\n[![GitHub Stars](https://img.shields.io/github/stars/daya6489/SmartEDA.svg?style=social)](https://img.shields.io/github/stars/daya6489/SmartEDA)\n\n\n---\n\n# Background\nIn a quality statistical data analysis the initial step has to be exploratory. Exploratory data analysis begins with the univariate exploratory analyis - examining the variable one at a time. Next comes bivariate analysis followed by multivariate analyis. SmartEDA package helps in getting the complete exploratory data analysis just by running the function instead of writing lengthy r code.\n\n# Functionalities of SmartEDA\n\nThe SmartEDA R package has four unique functionalities as\n\n* Descriptive statistics\n* Data visualization\n* Custom table\n* HTML EDA report\n\n![SmartEDA](https://github.com/daya6489/SmartEDA/blob/master/man/figures/smarteda_funtions.PNG)\n\n# Comparison with other packages\n\nSmartEDA package with other similar packages available in CRAN for exploratory data analysis viz. dlookr, DataExplorer, Hmisc, exploreR, RtutoR and summarytools. The metric for evaluation is the availability of various desired features for performing an Exploratory data analysis\n\n![SmartEDA](https://github.com/daya6489/SmartEDA/blob/master/man/figures/SmartEDA_comp.PNG)\n\n# Journal of Open Source Software Article\nAn article describing SmartEDA package for exploratory data analysis approach has been published in [arxiv](https://arxiv.org/pdf/1903.04754.pdf) and Journal of Open Source Software [JOSS](https://joss.theoj.org/papers/10.21105/joss.01509). Please cite the paper if you use SmartEDA in your work!\n\n# Installation\n\nThe package can be installed directly from CRAN.\n\n```R\ninstall.packages(\"SmartEDA\")\n```\n\nYou can install the latest development verion of the [SmartEDA](https://github.com/daya6489/SmartEDA) from github with:\n\t\n```R\ninstall.packages(\"devtools\")\ndevtools::install_github(\"daya6489/SmartEDA\",ref = \"develop\")\n```\n\n# Example\n\n## Data\nIn this vignette, we will be using a simulated data set containing sales of child car seats at 400 different stores. \n\nData Source [ISLR package](https://www.rdocumentation.org/packages/ISLR/versions/1.2/topics/Carseats).\n\nInstall the package \"ISLR\" to get the example data set.\n\n```R\n\tinstall.packages(\"ISLR\")\n\tlibrary(\"ISLR\")\n\tinstall.packages(\"SmartEDA\")\n\tlibrary(\"SmartEDA\")\n\t## Load sample dataset from ISLR pacakge\n\tCarseats= ISLR::Carseats\n```\n\n## Overview of the data\nUnderstanding the dimensions of the data set, variable names, overall missing summary and data types of each variables\n\n```R\n## overview of the data; \n\tExpData(data=Carseats,type=1)\n## structure of the data\t\n\tExpData(data=Carseats,type=2)\n```\n\n## Summary of numerical variables\nTo summarise the numeric variables, you can use following r codes from this pacakge\n\n```R\n## Summary statistics by – overall\n\tExpNumStat(Carseats,by=\"A\",gp=NULL,Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)\n## Summary statistics by – overall with correlation\t\n\tExpNumStat(Carseats,by=\"A\",gp=\"Price\",Qnt=seq(0,1,0.1),MesofShape=1,Outlier=TRUE,round=2)\n## Summary statistics by – category\n\tExpNumStat(Carseats,by=\"GA\",gp=\"Urban\",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)\n```\n\n## weighted summary for numerical variables\n```R\nExpNumStat(mtcars,by=\"A\",round=2, weight = \"wt\")\n```\n\n## Graphical representation of all numeric features\n\n```R\n## Generate Boxplot by category\nExpNumViz(mtcars,target=\"gear\",type=2,nlim=25,fname = file.path(tempdir(),\"Mtcars2\"),Page = c(2,2))\n## Generate Density plot\nExpNumViz(mtcars,target=NULL,type=3,nlim=25,fname = file.path(tempdir(),\"Mtcars3\"),Page = c(2,2))\n## Generate Scatter plot\nExpNumViz(mtcars,target=\"carb\",type=3,nlim=25,fname = file.path(tempdir(),\"Mtcars4\"),Page = c(2,2))\n\n```\n\n## Summary of Categorical variables\t\n\n```R\n## Frequency or custom tables for categorical variables\n\tExpCTable(Carseats,Target=NULL,margin=1,clim=10,nlim=5,round=2,bin=NULL,per=T)\n\tExpCTable(Carseats,Target=\"Price\",margin=1,clim=10,nlim=NULL,round=2,bin=4,per=F)\n\tExpCTable(Carseats,Target=\"Urban\",margin=1,clim=10,nlim=NULL,round=2,bin=NULL,per=F)\t\n\n## Summary statistics of categorical variables\n\tExpCatStat(Carseats,Target=\"Urban\",result = \"Stat\",clim=10,nlim=5,Pclass=\"Yes\")\n## Inforamtion value and Odds value\n\tExpCatStat(Carseats,Target=\"Urban\",result = \"IV\",clim=10,nlim=5,Pclass=\"Yes\")\n```\n## weighted count for categorical variables\n```R\nExpCTable(mtcars,  margin = 1, clim = 10, nlim = 3, bin = NULL, per = FALSE, weight = \"wt\"\")\n```\n\n## Graphical representation of all categorical variables\n\n```R\n## column chart\n\tExpCatViz(Carseats,target=\"Urban\",fname=NULL,clim=10,col=NULL,margin=2,Page = c(2,1),sample=2)\n## Stacked bar graph\n\tExpCatViz(Carseats,target=\"Urban\",fname=NULL,clim=10,col=NULL,margin=2,Page = c(2,1),sample=2)\n## Variable importance graph using information values\n  ExpCatStat(Carseats,Target=\"Urban\",result=\"Stat\",Pclass=\"Yes\",plot=TURE,top=20,Round=2)\n```\n## Variable importance based on Information value\n\n```R\n  ExpCatStat(Carseats,Target=\"Urban\",result = \"Stat\",clim=10,nlim=5,bins=10,Pclass=\"Yes\",plot=TRUE,top=10,Round=2)\n```\n\n## Create HTML EDA report\nCreate a exploratory data analysis report in HTML format\n\n```R\n\tExpReport(Carseats,Target=\"Urban\",label=NULL,op_file=\"test.html\",op_dir=getwd(),sc=2,sn=2,Rc=\"Yes\")\n```\n\n## Quantile-quantile plot for numeric variables\n\n```R\n\tExpOutQQ(CData,nlim=10,fname=NULL,Page=c(2,2),sample=4)\n```\n![](man/figures/qqplot-1-1.png)\u003c!-- --\u003e\n\n## Parallel Co-ordinate plots\n\n```R\n## Defualt ExpParcoord funciton\n\tExpParcoord(CData,Group=NULL,Stsize=NULL,Nvar=c(\"Price\",\"Income\",\"Advertising\",\"Population\",\"Age\",\"Education\"))\n## With Stratified rows and selected columns only\n  ExpParcoord(CData,Group=\"ShelveLoc\",Stsize=c(10,15,20),Nvar=c(\"Price\",\"Income\"),Cvar=c(\"Urban\",\"US\"))\n## Without stratification\n  ExpParcoord(CData,Group=\"ShelveLoc\",Nvar=c(\"Price\",\"Income\"),Cvar=c(\"Urban\",\"US\"),scale=NULL)\n## Scale change  \n  ExpParcoord(CData,Group=\"US\",Nvar=c(\"Price\",\"Income\"),Cvar=c(\"ShelveLoc\"),scale=\"std\")\n## Selected numeric variables\n  ExpParcoord(CData,Group=\"ShelveLoc\",Stsize=c(10,15,20),Nvar=c(\"Price\",\"Income\",\"Advertising\",\"Population\",\"Age\",\"Education\"))\n## Selected categorical variables\n  ExpParcoord(CData,Group=\"US\",Stsize=c(15,50),Cvar=c(\"ShelveLoc\",\"Urban\"))\n```\n![](man/figures/ccp6-1.png)\u003c!-- --\u003e\n\n## Two independent plots side by side for the same variable\n\nTo plot graph from same variable when Target=NULL vs. when Target = categorical variable (binary or multi-class variable)\n\n```R\ntarget = \"gear\"\ncategorical_features \u003c- c(\"vs\", \"am\", \"carb\")\nnumeircal_features \u003c- c(\"mpg\", \"cyl\", \"disp\", \"hp\", \"drat\", \"wt\", \"qsec\")\n\nnum_1 \u003c- ExpTwoPlots(mtcars, \n                     plot_type = \"numeric\",\n                     iv_variables = numeircal_features,\n                     target = \"gear\",\n                     lp_arg_list = list(alpha=0.5, color = \"red\", fill= \"white\", binwidth=1),\n                     lp_geom_type = 'histogram',\n                     rp_arg_list = list(alpha=0.5, fill = c(\"red\", \"orange\", \"pink\"),  binwidth=1),\n                     rp_geom_type = 'histogram',\n                     fname = \"dub2.pdf\",\n                     page = c(2,1),\n                     theme = \"Default\")\n\n```\n![](man/figures/c24_1.png)\u003c!-- --\u003e\n\n## Univariate Outlier analysis\n\nIn statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.An outlier can cause serious problems in statistical analyses.\n\nIdentifying outliers: There are several methods we can use to identify outliers. In ExpOutliers used two methods (1) Boxplot and (2) Standard Deviation\n\n![SmartEDA](https://github.com/daya6489/SmartEDA/blob/master/man/figures/outlierPlot_image.png)\n\n\n```R\n##Identifying outliers mehtod - Boxplot\nExpOutliers(Carseats, varlist = c(\"Sales\",\"CompPrice\",\"Income\"), method = \"boxplot\",  capping = c(0.1, 0.9))\n\n##Identifying outliers mehtod - 3 Standard Deviation\nExpOutliers(Carseats, varlist = c(\"Sales\",\"CompPrice\",\"Income\"), method = \"3xStDev\",  capping = c(0.1, 0.9))\n\n##Identifying outliers mehtod - 2 Standard Deviation\nExpOutliers(Carseats, varlist = c(\"Sales\",\"CompPrice\",\"Income\"), method = \"2xStDev\",  capping = c(0.1, 0.9))\n\n\n##Create outlier flag (1,0) if there are any outliers \nExpOutliers(Carseats, varlist = c(\"Sales\",\"CompPrice\",\"Income\"), method = \"3xStDev\",  capping = c(0.1, 0.9), outflag = TRUE)\n\n##Impute outlier value by mean or median valie\nExpOutliers(Carseats, varlist = c(\"Sales\",\"CompPrice\",\"Income\"), method = \"3xStDev\", treatment = \"mean\", capping = c(0.1, 0.9), outflag = TRUE)\n\n```\n\n## Exploratory analysis - Custom tables, summary statistics\nDescriptive summary on all input variables for each level/combination of group variable. Also while running the analysis we can filter row/cases of the data. \n\n```R\n\tExpCustomStat(Carseats,Cvar=c(\"US\",\"Urban\",\"ShelveLoc\"),gpby=FALSE)\n\tExpCustomStat(Carseats,Cvar=c(\"US\",\"Urban\"),gpby=TRUE,filt=NULL)\n\tExpCustomStat(Carseats,Cvar=c(\"US\",\"Urban\",\"ShelveLoc\"),gpby=TRUE,filt=NULL)\n\tExpCustomStat(Carseats,Cvar=c(\"US\",\"Urban\"),gpby=TRUE,filt=\"Population\u003e150\")\n\tExpCustomStat(Carseats,Cvar=c(\"US\",\"ShelveLoc\"),gpby=TRUE,filt=\"Urban=='Yes' \u0026 Population\u003e150\")\n\tExpCustomStat(Carseats,Nvar=c(\"Population\",\"Sales\",\"CompPrice\",\"Income\"),stat = c('Count','mean','sum','var','min','max'))\n\tExpCustomStat(Carseats,Nvar=c(\"Population\",\"Sales\",\"CompPrice\",\"Income\"),stat = c('min','p0.25','median','p0.75','max'))\n\tExpCustomStat(Carseats,Nvar=c(\"Population\",\"Sales\",\"CompPrice\",\"Income\"),stat = c('Count','mean','sum','var'),filt=\"Urban=='Yes'\")\n\tExpCustomStat(Carseats,Nvar=c(\"Population\",\"Sales\",\"CompPrice\",\"Income\"),stat = c('Count','mean','sum'),filt=\"Urban=='Yes' \u0026 Population\u003e150\")\n\tExpCustomStat(data_sam,Nvar=c(\"Population\",\"Sales\",\"CompPrice\",\"Income\"),stat = c('Count','mean','sum','min'),filt=\"All %ni% c(999,-9)\")\n\tExpCustomStat(Carseats,Nvar=c(\"Population\",\"Sales\",\"CompPrice\",\"Education\",\"Income\"),stat = c('Count','mean','sum','var','sd','IQR','median'),filt=c(\"ShelveLoc=='Good'^Urban=='Yes'^Price\u003e=150^ ^US=='Yes'\"))\n\tExpCustomStat(Carseats,Cvar = c(\"Urban\",\"ShelveLoc\"), Nvar=c(\"Population\",\"Sales\"), stat = c('Count','Prop','mean','min','P0.25','median','p0.75','max'),gpby=FALSE)\n\tExpCustomStat(Carseats,Cvar = c(\"Urban\",\"US\",\"ShelveLoc\"), Nvar=c(\"CompPrice\",\"Income\"), stat = c('Count','Prop','mean','sum','PS','min','max','IQR','sd'), gpby = TRUE)\n\tExpCustomStat(Carseats,Cvar = c(\"Urban\",\"US\",\"ShelveLoc\"), Nvar=c(\"CompPrice\",\"Income\"), stat = c('Count','Prop','mean','sum','PS','P0.25','median','p0.75'), gpby = TRUE,filt=\"Urban=='Yes'\")\n\tExpCustomStat(data_sam,Cvar = c(\"Urban\",\"US\",\"ShelveLoc\"), Nvar=c(\"Sales\",\"CompPrice\",\"Income\"), stat = c('Count','Prop','mean','sum','PS'), gpby = TRUE,filt=\"All %ni% c(888,999)\")\n\tExpCustomStat(Carseats,Cvar = c(\"Urban\",\"US\"), Nvar=c(\"Population\",\"Sales\",\"CompPrice\"), stat = c('Count','Prop','mean','sum','var','min','max'), filt=c(\"ShelveLoc=='Good'^Urban=='Yes'^Price\u003e=150\"))\n```\n\n## Articles\n\nSee [article wiki page](https://github.com/daya6489/SmartEDA/wiki/Articles).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaya6489%2FSmartEDA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaya6489%2FSmartEDA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaya6489%2FSmartEDA/lists"}