{"id":14958348,"url":"https://github.com/jbeno/datawaza","last_synced_at":"2025-07-24T05:32:30.576Z","repository":{"id":218327580,"uuid":"681079267","full_name":"jbeno/datawaza","owner":"jbeno","description":"Data science tools for exploration, visualization, and model iteration.","archived":false,"fork":false,"pushed_at":"2024-06-02T01:42:56.000Z","size":37179,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-20T20:16:42.648Z","etag":null,"topics":["data-science","dataviz","machine-learning","matplotlib","pandas","scikit-learn","seaborn"],"latest_commit_sha":null,"homepage":"http://datawaza.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jbeno.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-21T08:08:44.000Z","updated_at":"2024-08-27T06:04:48.000Z","dependencies_parsed_at":"2024-03-20T01:30:41.890Z","dependency_job_id":"d2100be5-cb40-497a-b1ff-93f3aacd550c","html_url":"https://github.com/jbeno/datawaza","commit_stats":{"total_commits":45,"total_committers":1,"mean_commits":45.0,"dds":0.0,"last_synced_commit":"0e2931dd66ea5973b7f3b1a24da82281d8be2f40"},"previous_names":["jbeno/datawaza"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/jbeno/datawaza","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jbeno%2Fdatawaza","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jbeno%2Fdatawaza/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jbeno%2Fdatawaza/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jbeno%2Fdatawaza/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jbeno","download_url":"https://codeload.github.com/jbeno/datawaza/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jbeno%2Fdatawaza/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266796832,"owners_count":23985483,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-24T02:00:09.469Z","response_time":99,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","dataviz","machine-learning","matplotlib","pandas","scikit-learn","seaborn"],"created_at":"2024-09-24T13:16:49.150Z","updated_at":"2025-07-24T05:32:30.555Z","avatar_url":"https://github.com/jbeno.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cbr /\u003e\n\u003cimg src=\"https://www.datawaza.com/en/latest/_static/datawaza_logo_name_trans.svg\" alt=\"datawaza_logo_name_trans.svg\" width=\"300\"/\u003e\n\n--------------------------------------\n[![PyPI Version](https://img.shields.io/pypi/v/datawaza)](https://pypi.org/project/datawaza/)\n[![License](https://img.shields.io/github/license/jbeno/datawaza)](https://github.com/jbeno/datawaza/blob/main/LICENSE)\n[![Last Commit](https://img.shields.io/github/last-commit/jbeno/datawaza)](https://github.com/jbeno/datawaza)\n[![Documentation Status](https://readthedocs.org/projects/datawaza/badge/?version=latest)](https://www.datawaza.com/en/latest/?badge=latest)\n[![Coverage Status](https://coveralls.io/repos/github/jbeno/datawaza/badge.svg?branch=main)](https://coveralls.io/github/jbeno/datawaza?branch=main)\n[![Python Version](https://img.shields.io/pypi/pyversions/datawaza)]()\n\nDatawaza streamlines common Data Science tasks. It's a collection of tools for data exploration, visualization, data cleaning, pipeline creation, hyper-parameter searching, model iteration, and evaluation. It builds upon core libraries like [Pandas](https://pandas.pydata.org/), [Matplotlib](https://matplotlib.org/), [Seaborn](https://seaborn.pydata.org/), and [Scikit-Learn](https://scikit-learn.org/stable/).\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_charts\"\u003e\u003cimg src=\"https://www.datawaza.com/en/latest/_static/plot_charts.png\" width=\"30%\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_map_ca\"\u003e\u003cimg src=\"https://www.datawaza.com/en/latest/_static/plot_map_ca.png\" width=\"30%\" style=\"margin:0 1%;\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.datawaza.com/en/latest/model.html#datawaza.model.compare_models\"\u003e\u003cimg src=\"https://www.datawaza.com/en/latest/_static/compare_models_2.png\" width=\"30%\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_corr\"\u003e\u003cimg src=\"https://www.datawaza.com/en/latest/_static/plot_corr.png\" width=\"30%\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.datawaza.com/en/latest/model.html#datawaza.model.plot_train_history\"\u003e\u003cimg src=\"https://www.datawaza.com/en/latest/_static/plot_train_history.png\" width=\"30%\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.datawaza.com/en/latest/model.html#datawaza.model.iterate_model\"\u003e\u003cimg src=\"https://www.datawaza.com/en/latest/_static/iterate_model_1.png\" width=\"30%\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://www.datawaza.com/en/latest/model.html#datawaza.model.iterate_model\"\u003e\u003cimg src=\"https://www.datawaza.com/en/latest/_static/iterate_model_2.png\" width=\"30%\" style=\"margin:0 1%;\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.datawaza.com/en/latest/model.html#datawaza.model.plot_results\"\u003e\u003cimg src=\"https://www.datawaza.com/en/latest/_static/plot_results.png\" width=\"30%\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.datawaza.com/en/latest/explore.html#datawaza.explore.print_ascii_image\"\u003e\u003cimg src=\"https://www.datawaza.com/en/latest/_static/print_ascii_image.png\" width=\"30%\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nInstallation\n------------\n\nThe latest release can be found on [PyPI](https://pypi.org/project/datawaza/). Install Datawaza with pip:\n\n    pip install datawaza\n\nSee the [Change Log](CHANGELOG.md) for a history of changes.\n\nDependencies\n------------\n\nDatawaza supports Python 3.9 - 3.12. Because Cartopy does not support Python 3.8, and that's a dependency for `plot_map_ca`, 3.8 is not supported.\n\nInstallation requires NumPy, Pandas, Matplotlib, Seaborn, Plotly, Scikit-Learn, SciPy, Cartopy, GeoPandas, StatsModels, TensorFlow, Keras, SciKeras (if utilizing KerasClassifier as a model), PyTorch, and a few other supporting packages. See the [Requirements.txt](https://github.com/jbeno/datawaza/blob/main/requirements.txt).\n\nDocumentation\n-------------\n\nOnline documentation is available at [Datawaza.com](https://datawaza.com).\n\nThe [User Guide](https://www.datawaza.com/en/latest/userguide.html) is a Jupyter notebook that walks through how to use the Datawaza functions. It's probably the best place to start. There is also an API reference for the major modules: [Clean](https://www.datawaza.com/en/latest/clean.html), [Explore](https://www.datawaza.com/en/latest/explore.html), [Model](https://www.datawaza.com/en/latest/model.html), and [Tools](https://www.datawaza.com/en/latest/tools.html).\n\nDevelopment\n-----------\n\nThe [Datawaza repo](https://github.com/jbeno/datawaza) is on GitHub.\n\nPlease submit bugs that you encounter to the [Issue Tracker](https://github.com/jbeno/datawaza/issues). Contributions and ideas for enhancements are welcome!\n\nWhat is Waza?\n-------------\n\nWaza (技) means \"technique\" in Japanese. In martial arts like Aikido, it is paired with words like \"suwari-waza\" (sitting techniques) or \"kaeshi-waza\" (reversal techniques). So we've paired it with \"data\" to represent Data Science techniques: データ技 \"data-waza\".\n\nOrigin Story\n-------------\n\nMost of these functions were created while I was pursuing a [Professional Certificate](https://em-executive.berkeley.edu/professional-certificate-machine-learning-artificial-intelligence) in Machine Learning \u0026 Artificial Intelligence from U.C. Berkeley. With each assignment, I tried to simplify repetitive tasks and streamline my workflow. They served me well at the time, so perhaps they will be of value to others.\n\nQuick Start\n-----------\n\nThe [User Guide](https://www.datawaza.com/en/latest/userguide.html) will show you how to use Datawaza's functions in depth. Assuming you already have data loaded, here are some examples of what it can do:\n\n    \u003e\u003e\u003e import datawaza as dw\n    \nShow the unique values of each variable below the threshold of n = 12:\n\n    \u003e\u003e\u003e dw.get_unique(df, 12, count=True, percent=True)\n\n    CATEGORICAL: Variables with unique values equal to or below: 12\n    \n    job has 12 unique values:\n    \n        admin.              10422   25.3%\n        blue-collar         9254    22.47%\n        technician          6743    16.37%\n        services            3969    9.64%\n        management          2924    7.1%\n        retired             1720    4.18%\n        entrepreneur        1456    3.54%\n        self-employed       1421    3.45%\n        housemaid           1060    2.57%\n        unemployed          1014    2.46%\n        student             875     2.12%\n        unknown             330     0.8%\n    \n    marital has 4 unique values:\n    \n        married        24928   60.52%\n        single         11568   28.09%\n        divorced       4612    11.2%\n        unknown        80      0.19%\n\nPlot bar charts of categorical variables:\n\n    \u003e\u003e\u003e dw.plot_charts(df, plot_type='cat', cat_cols=cat_columns, rotation=90)\n\n![plot_charts output](https://www.datawaza.com/en/latest/_static/plot_charts_output.png)\n\nGet the top positive and negative correlations with the target variable, and save to lists:\n\n    \u003e\u003e\u003e pos_features, neg_features = dw.get_corr(df_enc, n=10, var='subscribed_enc', return_arrays=True)\n\n    Top 10 positive correlations:\n                  Variable 1      Variable 2  Correlation\n    0               duration  subscribed_enc         0.41\n    1       poutcome_success  subscribed_enc         0.32\n    2   previously_contacted  subscribed_enc         0.32\n    3                  pdays  subscribed_enc         0.27\n    4               previous  subscribed_enc         0.23\n    5              month_mar  subscribed_enc         0.14\n    6              month_oct  subscribed_enc         0.14\n    7              month_sep  subscribed_enc         0.12\n    8           no_default_1  subscribed_enc         0.10\n    9            job_student  subscribed_enc         0.09\n    \n    Top 10 negative correlations:\n                  Variable 1      Variable 2  Correlation\n    0            nr.employed  subscribed_enc        -0.35\n    1              euribor3m  subscribed_enc        -0.31\n    2           emp.var.rate  subscribed_enc        -0.30\n    3   poutcome_nonexistent  subscribed_enc        -0.19\n    4      contact_telephone  subscribed_enc        -0.14\n    5         cons.price.idx  subscribed_enc        -0.14\n    6              month_may  subscribed_enc        -0.11\n    7               campaign  subscribed_enc        -0.07\n    8        job_blue-collar  subscribed_enc        -0.07\n    9     education_basic.9y  subscribed_enc        -0.05\n\nPlot a chart showing the top correlations with the target variable:\n\n    \u003e\u003e\u003e dw.plot_corr(df_enc, 'subscribed_enc', n=16, size=(12,6), rotation=90)\n\n![plot_corr output](https://www.datawaza.com/en/latest/_static/plot_corr_output.png)\n\nRun a regression model iteration, which dynamically assembles a pipeline and evaluates the model, including\ncharts of residuals, predicted vs. actual, and coefficients:\n\n    \u003e\u003e\u003e results_df, iteration_6 = dw.iterate_model(X2_train, X2_test, y2_train, y2_test,\n    ...     transformers=['ohe', 'log', 'poly3'], model='linreg',\n    ...     iteration='6', note='X2. Test size: 0.25, Pipeline: OHE \u003e Log \u003e Poly3 \u003e LinReg',\n    ...     plot=True, lowess=True, coef=True, perm=True, vif=True, decimal=2,\n    ...     save=True, save_df=results_df, config=my_config)\n\n![iterate_model output 1 of 3](https://www.datawaza.com/en/latest/_static/iterate_model_output_1.png)\n![iterate_model output 2 of 3](https://www.datawaza.com/en/latest/_static/iterate_model_output_2.png)\n![iterate_model output 3 of 3](https://www.datawaza.com/en/latest/_static/iterate_model_output_3.png)\n\nCompare train/test scores across model iterations, and select the best result:\n\n    \u003e\u003e\u003e dw.plot_results(results_df, metrics=['Train MAE', 'Test MAE'], y_label='Mean Absolute Error',\n    ...     select_metric='Test MAE', select_criteria='min', decimal=0)\n\n![plot_results output](https://www.datawaza.com/en/latest/_static/plot_results_output.png)\n\nDefine a configuration file to compare multiple binary classification models:\n\n    \u003e\u003e\u003e # Set some variables referenced in the config\n    \u003e\u003e\u003e random_state = 42\n    \u003e\u003e\u003e class_weight = None\n    \u003e\u003e\u003e max_iter = 10000\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e # Set column lists referenced in the config\n    \u003e\u003e\u003e num_columns = list(X.columns)\n    \u003e\u003e\u003e cat_columns = []\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e # Create a custom configuration file with 3 models and grid search params\n    \u003e\u003e\u003e my_config = {\n    ...     'models' : {\n    ...         'logreg': LogisticRegression(max_iter=max_iter,\n    ...                   random_state=random_state, class_weight=class_weight),\n    ...         'knn_class': KNeighborsClassifier(),\n    ...         'tree_class': DecisionTreeClassifier(random_state=random_state,\n    ...                       class_weight=class_weight)\n    ...     },\n    ...     'imputers': {\n    ...         'simple_imputer': SimpleImputer()\n    ...     },\n    ...     'transformers': {\n    ...         'ohe': (OneHotEncoder(drop='if_binary', handle_unknown='ignore'),\n    ...                     cat_columns)\n    ...     },\n    ...     'scalers': {\n    ...         'stand': StandardScaler()\n    ...     },\n    ...     'selectors': {\n    ...         'sfs_logreg': SequentialFeatureSelector(LogisticRegression(\n    ...                       max_iter=max_iter, random_state=random_state,\n    ...                       class_weight=class_weight))\n    ...     },\n    ...     'params' : {\n    ...         'logreg': {\n    ...             'logreg__C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],\n    ...             'logreg__solver': ['newton-cg', 'lbfgs', 'saga']\n    ...         },\n    ...         'knn_class': {\n    ...             'knn_class__n_neighbors': [3, 5, 10, 15, 20, 25],\n    ...             'knn_class__weights': ['uniform', 'distance'],\n    ...             'knn_class__metric': ['euclidean', 'manhattan']\n    ...         },\n    ...         'tree_class': {\n    ...             'tree_class__max_depth': [3, 5, 7],\n    ...             'tree_class__min_samples_split': [5, 10, 15],\n    ...             'tree_class__criterion': ['gini', 'entropy'],\n    ...             'tree_class__min_samples_leaf': [2, 4, 6]\n    ...         },\n    ...     },\n    ...     'cv': {\n    ...         'kfold_5': KFold(n_splits=5, shuffle=True, random_state=42)\n    ...     },\n    ...     'no_scale': ['tree_class'],\n    ...     'no_poly': ['knn_class', 'tree_class']\n    ... }\n\nRun a binary classification on 7 models, dynamically assembling the pipeline and\nperforming a grid search of the hyper-parameters, all based on the configuration\nfile defined above:\n\n    \u003e\u003e\u003e results_df = compare_models(\n    ...\n    ...     # Data split and sampling\n    ...     x=X, y=y, test_size=0.25, stratify=None, under_sample=None,\n    ...     over_sample=None, svm_knn_resample=None,\n    ...\n    ...     # Models and pipeline steps\n    ...     imputer=None, transformers=None, scaler='stand', selector=None,\n    ...     models=['logreg', 'knn_class', 'svm_proba', 'tree_class',\n    ...     'forest_class', 'xgb_class', 'keras_class'], svm_proba=True,\n    ...\n    ...     # Grid search\n    ...     search_type='random', scorer='accuracy', grid_cv='kfold_5', verbose=1,\n    ...\n    ...     # Model evaluation and charts\n    ...     model_eval=True, plot_perf=True, plot_curve=True, fig_size=(12,6),\n    ...     legend_loc='lower left', rotation=45, threshold=0.5,\n    ...     class_map=class_map, pos_label=1, title='Breast Cancer',\n    ...\n    ...     # Config, preferences and notes\n    ...     config=my_config, class_weight=None, random_state=42, decimal=4,\n    ...     n_jobs=None, notes='Test Size=0.25, Threshold=0.50'\n    ... )  #doctest: +NORMALIZE_WHITESPACE\n\n![compare_models output 1 of 5](https://www.datawaza.com/en/latest/_static/compare_models_output_1.png)\n![compare_models output 2 of 5](https://www.datawaza.com/en/latest/_static/compare_models_output_2.png)\n![compare_models output 3 of 5](https://www.datawaza.com/en/latest/_static/compare_models_output_3.png)\n![compare_models output 4 of 5](https://www.datawaza.com/en/latest/_static/compare_models_output_4.png)\n![compare_models output 5 of 5](https://www.datawaza.com/en/latest/_static/compare_models_output_5.png)\n\nThis was just a sample of some Datawaza tools. Download [userguide.ipynb](https://github.com/jbeno/datawaza/blob/main/docs/userguide.ipynb) and explore the full breadth of the library in your Jupyter environment.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjbeno%2Fdatawaza","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjbeno%2Fdatawaza","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjbeno%2Fdatawaza/lists"}