{"id":18561634,"url":"https://github.com/ajayarunachalam/eda","last_synced_at":"2025-04-10T03:31:00.367Z","repository":{"id":130447544,"uuid":"127417294","full_name":"ajayarunachalam/EDA","owner":"ajayarunachalam","description":"Exploratory Data Analysis","archived":false,"fork":false,"pushed_at":"2020-09-10T20:34:32.000Z","size":743,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-24T15:47:22.116Z","etag":null,"topics":["dashboard","data-analysis","eda","exploratory-data-analysis","python","r","visualization"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ajayarunachalam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-30T10:49:15.000Z","updated_at":"2022-10-10T14:47:30.000Z","dependencies_parsed_at":"2023-03-15T12:45:56.652Z","dependency_job_id":null,"html_url":"https://github.com/ajayarunachalam/EDA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajayarunachalam%2FEDA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajayarunachalam%2FEDA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajayarunachalam%2FEDA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajayarunachalam%2FEDA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ajayarunachalam","download_url":"https://codeload.github.com/ajayarunachalam/EDA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248150771,"owners_count":21055980,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dashboard","data-analysis","eda","exploratory-data-analysis","python","r","visualization"],"created_at":"2024-11-06T22:07:30.268Z","updated_at":"2025-04-10T03:31:00.354Z","avatar_url":"https://github.com/ajayarunachalam.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# EDA\n\nLike any aspiring data scientist out there, i also started my journey mostly doing Exploratory Data Analysis (EDA). \nLet us understand one thing very clearly that, before we talk about AI or infact any machine learning stuff, Data Analysis plays a very important role in the entire Data Science Workflow. In fact, it takes most of the time of the entire Workflow.\n\nEDA is the initial and an important phase of the workflow. It helps, to get a first look of the data, and help generate relevant hypothesis and decide next steps. However, the EDA process could be a hassle most of the times.\n\nThere are many libraries \u0026 packages for Python \u0026 R with which most basic data analysis can be done.\n\nBut, i would like to share 2 interesting ones which i feel is the most suitable for quick analysis. This is my personal opinion. I am not saying that the others are not good. I acknowledge the authors for their wonderful contribution to the community.\n\nPython EDA package:-\n--------------------\npandas-profiling\n\nIt generates profile reports from a pandas DataFrame. \n\n----------------------------------------------------------\nFor each relevant column based on the column type an descriptive statistics is presented in an interactive HTML report along with the following features:-\n\nEssentials: type, unique values, missing values\nQuantile statistics: like minimum value, Q1, median, Q3, maximum, range, interquartile range\n\nDescriptive statistics: like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness\n\nMost frequent values\n\nHistogram\n\nCorrelations: highlighting of highly correlated variables, Spearman and Pearson matrixes\n\nInstallation:-\n-------------\n\npip install pandas-profiling\n or \nconda install pandas-profiling\n\nR EDA package:-\n---------------\nDataExplorer\n\nCreates report performing the entire EDA rendered as Rmarkdown html document.\n\nInstallation:-\n--------------\n\ninstall.packages(\"DataExplorer\")\n\n--------------------------------------------------------\n\nNow let us go through a simple exercise for the usage of these libraries \u0026 packages individually.\n\nDataset: IRIS \n(This data sets consists of 3 different types of iris flowers mainly Setosa, Versicolour, and Virginica)\n\nNote:- For the sake of simplicity \u0026 clear understanding i sticked to this dataset, but all the above steps will work for any dataset.\n\nPython code:-\n-------------\n\n# Import pandas \u0026 pandas_profiling libraries\n\nimport pandas as pd\n\nimport pandas_profiling\n\n# For IRIS dataset import the datasets from sklearn\nfrom sklearn import datasets\n\n# import data \niris = datasets.load_iris()\n\n# check type \ntype(iris)\n\n# Converting to pandas dataframe\niris_data = pd.DataFrame(iris.data, columns=iris.feature_names)\n\n# labeling the target\niris_data['target'] = iris['target']\n\n# get the number of rows and columns\niris_data.shape\n\n# list the columns\niris_data.columns\n\n# rename columns\niris_data.columns=['sepal_length', 'sepal_width', 'petal_length','petal_width', 'class']\n\niris_data.columns\n\n# see few records\niris_data.head()\n\n# check the distribution of the class \niris_data['class'].value_counts()\n\n# To generate Inline report without saving object\npandas_profiling.ProfileReport(iris_data)\n\n# To retrieve the list of variables which are rejected due to high correlation\n\nprofile = pandas_profiling.ProfileReport(iris_data)\n\n# Rejected variables\nrejected_variables = profile.get_rejected_variables(threshold=0.9)\n\nrejected_variables\n\n# To Generate a HTML report file\nprofile = pandas_profiling.ProfileReport(iris_data)\n\n# output to html file\nprofile.to_file(outputfile=\"../profiling_iris.html\")\n\n\nR code:-\n--------\n### title: \"Create report performing EDA rendered as Rmarkdown html document with DataExplorer package\"###\n\n# Clear workspace and memory\nrm(list = ls()); gc(reset = TRUE)\n\nsetwd('C:\\\\Users\\\\Desktop\\\\WORKING\\\\EDA') # change this path to your working directory #\n\n# load dataset\n\ndata(iris)\n\nload(iris)\n\n# set seed for reproducibility \nset.seed(250)\n\n#Let us begin our Exploratory Data Analysis by loading the library:\nlibrary(DataExplorer)\n\n# Variables\n# The very first thing that you'd want to do in your EDA is checking the dimension of the input dataset.\nplot_str(iris)\n\n# Looking for Missing Values\nplot_missing(iris)\n\n# Histogram of Continuous Variables\nplot_histogram(iris)\n\n# Density plot\nplot_density(iris)\n\n# Colorful Correlation Plot\nplot_correlation(iris, type = 'all','Species')\n\n# plot Categorical Variables - Barplots!\nplot_bar(iris) \n\n# Finally,create consolidated report with create_report()\ncreate_report(iris) #comment this if you're not rendering this entire rmarkdown\n\nConclusion\n----------\nWe see that both the packages aims to automate most of data handling and visualization stuff, so that users could focus on studying the insights.\n\n\nAcknowledgment\n--------------\n[1] Author: Jos Polfliet\nhttps://pypi.python.org/pypi/pandas-profiling\n\n[2] Author: Boxuan Cui\nhttps://cran.r-project.org/web/packages/DataExplorer/index.html\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajayarunachalam%2Feda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fajayarunachalam%2Feda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajayarunachalam%2Feda/lists"}