{"id":13738353,"url":"https://github.com/kianweelee/Edator","last_synced_at":"2025-05-08T16:33:24.469Z","repository":{"id":57425583,"uuid":"264381262","full_name":"kianweelee/Edator","owner":"kianweelee","description":" A python package that performs exploratory data analysis for users. Additionally, it generates 3 types of output files (cleaned CSV, plots and a text report).","archived":false,"fork":false,"pushed_at":"2020-09-11T13:12:06.000Z","size":356,"stargazers_count":76,"open_issues_count":0,"forks_count":9,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-05-01T14:45:49.026Z","etag":null,"topics":["data-analysis","data-science","exploratory-data-analysis"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kianweelee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-16T07:16:01.000Z","updated_at":"2025-04-28T05:03:10.000Z","dependencies_parsed_at":"2022-08-29T22:00:34.549Z","dependency_job_id":null,"html_url":"https://github.com/kianweelee/Edator","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kianweelee%2FEdator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kianweelee%2FEdator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kianweelee%2FEdator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kianweelee%2FEdator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kianweelee","download_url":"https://codeload.github.com/kianweelee/Edator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253105473,"owners_count":21855039,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-science","exploratory-data-analysis"],"created_at":"2024-08-03T03:02:19.721Z","updated_at":"2025-05-08T16:33:24.108Z","avatar_url":"https://github.com/kianweelee.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"![](https://raw.githubusercontent.com/kianweelee/Edator/master/Image/eau%20de%20parfum.png)\n# Edator\n\n[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)\n[![CodeFactor](https://www.codefactor.io/repository/github/kianweelee/edator/badge)](https://www.codefactor.io/repository/github/kianweelee/edator)\n[![GitHub license](https://img.shields.io/github/license/Naereen/StrapDown.js.svg)](https://github.com/Naereen/StrapDown.js/blob/master/LICENSE)\n![](https://img.shields.io/bitbucket/issues-raw/kianweelee/Edator)\n[![](https://img.shields.io/github/v/release/kianweelee/edator)](https://github.com/kianweelee/Edator/releases)\n![](https://img.shields.io/github/last-commit/kianweelee/edator)\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](https://github.com/kianweelee/Edator/pulls)\n[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/kianweelee/edator/issues)\n\nThis is a python package that performs exploratory data analysis for users. It takes in a csv file and generates 3 documents that comprise of a text report containing a descriptive summary, a series of plots and a cleaned csv output.\n \n## Set up\n### Dependencies \n- Python 3.8x\n- matplotlib==3.1.2\n- numpy==1.18.1\n- pandas==1.0.0\n- PySimpleGUI==4.19.0\n- scikit-learn==0.22.1\n- scipy==1.4.1\n- seaborn==0.10.0\n- statsmodels==0.11.1\n- more-itertools==8.3.0\n\n### How to set up? (**Important!**)\n1. You can clone or download my package.\n2. Using terminal, move to the directory. \n   - Example for Mac OS users: \n   ```bash\n   $ cd Downloads/Edator\n   ```\n3. Install the required packages using:\n   ```py\n   pip install -r requirements.txt\n   ```\n4. After that, change directory into the Script folder using:\n   ```bash\n   $ cd Script\n   ```\n5. Now, execute the main.py file by:\n   ```py\n   $ python main.py\n   ```\n6. You should see the following:\n\n![](https://github.com/kianweelee/Edator/blob/master/Image/Screen%20Shot%202020-06-11%20at%208.32.55%20pm.png)\n\n7. Choose the format of the file (csv or xls), the path to the file and the paths to export the plots, the report and the cleaned csv file to.\n\n8. Done!\n\n## The concept behind Edator\n\n### Dealing with NaN values and zeros\nHow I deal with NaN value is that I only remove the affected rows when the percentage of NaN within that column is **less than 5%**. This applies to both numerical and categorical values. For anything above 5%, I replace the NaN values with median. For categorical values, the NaN values will be replace by mode.\n\nDealing with zeros is much harder as it is challenging to differentiate between a zero that is meaningful (has a purpose and should not be removed) and a zero that serves no purpose and can potentially add more noise to the dataset. Hence, I decided to inform the user about the percentage of zeros in the dataset.\n\n### Processing outliers\nI use Z-score to detect outliers. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.\n\nIn most cases, a threshold of 3 or -3 is used to filter off outliers and I have used this approach for all of my analysis.\n\n### Correlation\nFor correlation, I included:\n1. Pearson and Spearman correlation for numerical-numerical variables.\n2. One Way ANOVA for numerical-categorical variables\n3. Chi-Square test for categorical-categorical variables\n\nUsing itertools.combinations, I identify every possible combinations among numerical-numerical variables, numerical-categorical variables and categorical-categorical variables. I then apply the correlation test based on the criteria I have set above.\n\n### Plots\nFor plots, I created:\n1. Scatterplot for numerical variables\n2. Countplot for categorical variables\n3. Boxplot for numerical-categorical variables\n\nSimilar to correlation, I used itertools.combinations to create every possible plot. I have also added the hue feature to each scatterplot. I will only do so when the categorical variable has less than 5 unique values. Example, if hue = \"fruits\", I should only see 4 types of fruits.\n\n### Upcoming changes for version 0.3\n1. Take in more file outputs beyond CSV and Excel\n2. Gathering user input, I will increase the variety of plots beyond scatterplots, barplots and boxplots.\n3. Report generated will be in HTML format. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkianweelee%2FEdator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkianweelee%2FEdator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkianweelee%2FEdator/lists"}