{"id":13407113,"url":"https://github.com/ayush1997/visualize_ML","last_synced_at":"2025-03-14T11:31:04.962Z","repository":{"id":62587585,"uuid":"64548922","full_name":"ayush1997/visualize_ML","owner":"ayush1997","description":"Python package for consolidated and extensive Univariate,Bivariate Data Analysis and Visualization catering to both categorical and continuous datasets.","archived":false,"fork":false,"pushed_at":"2016-09-28T17:34:24.000Z","size":210,"stargazers_count":197,"open_issues_count":0,"forks_count":29,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-10-01T07:11:23.427Z","etag":null,"topics":["data-analysis","machine-learning","matplotlib","python","statisics","visualization"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/visualize_ML/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ayush1997.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-07-30T14:49:57.000Z","updated_at":"2024-08-15T05:11:53.000Z","dependencies_parsed_at":"2022-11-03T22:10:17.933Z","dependency_job_id":null,"html_url":"https://github.com/ayush1997/visualize_ML","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ayush1997%2Fvisualize_ML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ayush1997%2Fvisualize_ML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ayush1997%2Fvisualize_ML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ayush1997%2Fvisualize_ML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ayush1997","download_url":"https://codeload.github.com/ayush1997/visualize_ML/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243569347,"owners_count":20312410,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","machine-learning","matplotlib","python","statisics","visualization"],"created_at":"2024-07-30T20:00:21.536Z","updated_at":"2025-03-14T11:31:04.267Z","avatar_url":"https://github.com/ayush1997.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":["Data Exploration","General-Purpose Machine Learning"],"readme":"# visualize_ML\n\nvisualize_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem. It is build on libraries like matplotlib for visualization and sklean,scipy for statistical computations.\n\n[![PyPI version](https://badge.fury.io/py/visualize_ML.svg)](https://badge.fury.io/py/visualize_ML)\n### Table of content:\n* [Requirements](https://github.com/ayush1997/visualize_ML/#requirement)\n* [Install](https://github.com/ayush1997/visualize_ML/#install)\n* [Let's code](https://github.com/ayush1997/visualize_ML/#lets-code)\n\t* [explore module](https://github.com/ayush1997/visualize_ML/#-explore-module)\n\t* [relation module](https://github.com/ayush1997/visualize_ML/#-relation-module)\n* [Contribute](https://github.com/ayush1997/visualize_ML/#contribute)\n* [Tasks To Do](https://github.com/ayush1997/visualize_ML/#tasks-to-do)\n* [Licence](https://github.com/ayush1997/visualize_ML/#licence)\n* [Copyright](https://github.com/ayush1997/visualize_ML/#copyright)\n\n\n## Requirement\n\n* python 2.x or python 3.x\n\n## Install\nInstall dependencies needed for matplotlib\n\n\tsudo apt-get build-dep python-matplotlib\n\nInstall it using pip\n\n\tpip install visualize_ML\n\n\n\n\n## Let's Code\n\nWhile dealing with a Machine Learning problem some of the initial steps involved are data exploration,analysis followed by feature selection.Below are the modules for these tasks.\n\n### 1) Data Exploration\nAt this stage, we explore variables one by one using **Uni-variate Analysis** which depends on whether the variable type is categorical or continuous .To deal with this we have the **explore** module.\n\n## \u003e\u003e\u003e explore module\n\tvisualize_ML.explore.plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE=4,bin_size=20,\n\tbar_width=0.2,wspace=0.5,hspace=0.8)\n**Continuous Variables** : In case of continous variables it plots the *Histogram* for every variable and gives descriptive statistics for them.\n\n**Categorical Variables** : In case on categorical variables with 2 or more classes it plots the *Bar chart* for every variable and gives descriptive statistics for them.\n\nParameters | Type | Description\n-------------------- | -------------|------------------------------------------------------------------------\ndata_input  | Dataframe\t| This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.)\ncategorical_name| list (default=[ ])| Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes.\ndrop | list default=[ ]|Names of columns to be dropped.\nPLOT_COLUMNS_SIZE| int (default=4)|Number of plots to display vertically in the display window.The row size is adjusted accordingly.\nbin_size |int (default=\"auto\") | Number of bins for the histogram displayed in the categorical vs categorical category.\nwspace | float32 (default = 0.5) |Horizontal padding between subplot on the display window.\nhspace | float32 (default = 0.8) |Vertical padding between subplot on the display window.\n\n\n**Code Snippet**\n```python\n/* The data set is taken from famous Titanic data(Kaggle)*/\n\nimport pandas as pd\nfrom visualize_ML import explore\ndf = pd.read_csv(\"dataset/train.csv\")\nexplore.plot(df,[\"Survived\",\"Pclass\",\"Sex\",\"SibSp\",\"Ticket\",\"Embarked\"],drop=[\"PassengerId\",\"Name\"])\n```\n![Alt text](https://github.com/ayush1997/visualize_ML/blob/master/images/explore1.png?raw=true \"Optional Title\")\n\nsee the [dataset](https://www.kaggle.com/c/titanic/data)\n\n**Note:** While plotting all the rows with **NaN** values and columns with **Character** values are removed(except if values are True and False ),only numeric data is plotted.\n\n### 2) Feature Selection\nThis is one of the challenging task to deal with for a ML task.Here we have to do **Bi-variate Analysis** to find out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.\n\n**relation** module helps in visualizing the analysis done on various combination of variables and see relation between them.\n\n## \u003e\u003e\u003e relation module\n\tvisualize_ML.relation.plot(data_input,target_name=\"\",categorical_name=[],drop=[],bin_size=10)\n\n**Continuous vs Continuous variables:** To do the Bi-variate analysis *scatter plots* are made as their pattern indicates the relationship between variables.\nTo indicates the strength of relationship amongst them we use Correlation between them.\n\nThe graph displays the correlation coefficient along with other information.\n\n\tCorrelation = Covariance(X,Y) / SQRT( Var(X)*Var(Y))\n\n* -1: perfect negative linear correlation\n* +1:perfect positive linear correlation and\n* 0: No correlation\n\n**Categorical vs Categorical variables**: *Stacked Column Charts* are made to visualize the relation.**Chi square test** is used to derive the statistical significance of relationship between the variables. It returns *probability* for the computed chi-square distribution with the degree of freedom. For more information on Chi Test see [this](http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm)\n\nProbability of 0: It indicates that both categorical variable are dependent\n\nProbability of 1: It shows that both variables are independent.\n\nThe graph displays the *p_value* along with other information. If it is leass than **0.05** it states that the variables are dependent.\n\n**Categorical vs Continuous variables:** To explore the relation between categorical and continuous variables,box plots re drawn at each level of categorical variables. If levels are small in number, it will not show the statistical significance.\n**ANOVA test** is used to derive the statistical significance of relationship between the variables.\n\nThe graph displays the *p_value* along with other information. If it is leass than **0.05** it states that the variables are dependent.\n\nFor more information on ANOVA test see [this](https://onlinecourses.science.psu.edu/stat200/book/export/html/66)\n\nParameters | Type | Description\n-------------------- | -------------|--------------------------------------------------------------------\ndata_input  | Dataframe\t| This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.)\ntarget_name | String | The name of the target column.\ncategorical_name| list (default=[ ])| Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes.\ndrop | list default=[ ]|Names of columns to be dropped.\nPLOT_COLUMNS_SIZE| int (default=4)|Number of plots to display vertically in the display window.The row size is adjusted accordingly.\nbin_size |int (default=\"auto\") | Number of bins for the histogram displayed in the categorical vs categorical category.\nwspace | float32 (default = 0.5) |Horizontal padding between subplot on the display window.\nhspace | float32 (default = 0.8) |Vertical padding between subplot on the display window.\n\n**Code Snippet**\n```python\n/* The data set is taken from famous Titanic data(Kaggle)*/\nimport pandas as pd\nfrom visualize_ML import relation\ndf = pd.read_csv(\"dataset/train.csv\")\nrelation.plot(df,\"Survived\",[\"Survived\",\"Pclass\",\"Sex\",\"SibSp\",\"Ticket\",\"Embarked\"],drop=[\"PassengerId\",\"Name\"],bin_size=10)\n\n```\n\n![Alt text](https://github.com/ayush1997/visualize_ML/blob/master/images/relation1.png?raw=true \"Optional Title\")\n\nsee the [dataset](https://www.kaggle.com/c/titanic/data)\n\n**Note:** While plotting all the rows with **NaN** values and columns with **Non numeric** values are removed only numeric data is plotted.Only categorical taget variable with string values are allowed.\n\n## Contribute\nIf you want to contribute and add new feature feel free to send Pull request [here](https://github.com/ayush1997/visualize_ML)\n\nThis project is still under development so to report any bugs or request new features, head over to the Issues page\n\n## Tasks To Do\n- [ ] Make input compatible with other formats like Numpy.\n- [ ] Visualize best fit lines and decision boundaries for various models to make **Parameter Tuning** task easy.\n\n\tand many others!\n\n## Licence\nLicensed under [The MIT License (MIT)](https://github.com/ayush1997/visualize_ML/blob/master/LICENSE.txt).\n\n## Copyright\nayush1997(c) 2016\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fayush1997%2Fvisualize_ML","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fayush1997%2Fvisualize_ML","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fayush1997%2Fvisualize_ML/lists"}