{"id":13473733,"url":"https://github.com/eugeneyan/testing-ml","last_synced_at":"2025-04-09T16:20:09.277Z","repository":{"id":54604645,"uuid":"291571705","full_name":"eugeneyan/testing-ml","owner":"eugeneyan","description":"🔍 Minimal examples of machine learning tests for implementation, behaviour, and performance.","archived":false,"fork":false,"pushed_at":"2022-09-21T15:29:15.000Z","size":174,"stargazers_count":262,"open_issues_count":1,"forks_count":50,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-09T16:20:05.634Z","etag":null,"topics":["machine-learning","model-evaluation","testing"],"latest_commit_sha":null,"homepage":"https://eugeneyan.com/writing/testing-ml/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eugeneyan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-08-30T23:46:54.000Z","updated_at":"2025-03-22T22:00:22.000Z","dependencies_parsed_at":"2022-08-13T21:10:33.345Z","dependency_job_id":null,"html_url":"https://github.com/eugeneyan/testing-ml","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugeneyan%2Ftesting-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugeneyan%2Ftesting-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugeneyan%2Ftesting-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eugeneyan%2Ftesting-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eugeneyan","download_url":"https://codeload.github.com/eugeneyan/testing-ml/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248065281,"owners_count":21041872,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","model-evaluation","testing"],"created_at":"2024-07-31T16:01:06.316Z","updated_at":"2025-04-09T16:20:09.254Z","avatar_url":"https://github.com/eugeneyan.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# testing-ml\n\n![Tests](https://github.com/eugeneyan/testing-ml/workflows/Tests/badge.svg?branch=master) [![codecov](https://codecov.io/gh/eugeneyan/testing-ml/branch/master/graph/badge.svg)](https://codecov.io/gh/eugeneyan/testing-ml) [![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/eugeneyan/testing-ml/pulls)\n\nHow to test machine learning code. In this example, we'll test a `numpy` implementation of `DecisionTree` and `RandomForest` via:  \n- [Pre-train tests](#pre-train-tests-to-ensure-correct-implementation) to ensure correct implementation  \n- [Post-train tests](#post-train-tests-to-ensure-expected-learned-behaviour) to ensure expected learned behaviour  \n- [Evaluation](#evaluation-to-ensure-satisfactory-model-performance) to ensure satisfactory model performance \n\n![](https://raw.githubusercontent.com/eugeneyan/testing-ml/master/testing-ml-flow.png)\n\nAccompanying article: [How to Test Machine Learning Code and Systems](https://eugeneyan.com/writing/testing-ml/). Inspired by [@jeremyjordan](https://twitter.com/jeremyjordan)'s [Effective Testing for Machine Learning Systems](https://www.jeremyjordan.me/testing-ml/).\n\n## Quick Start\n```\n# Clone and setup environment\ngit clone https://github.com/eugeneyan/testing-ml.git\ncd testing-ml\nmake setup\n\n# Run test suite\nmake check\n```\n\n## Standard software habits\n- Unit test [fixture reuse](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/fixtures.py), [exceptions testing](https://github.com/eugeneyan/testing-ml/blob/master/tests/data_prep/test_continuous.py#L44) with [`pytest`](https://docs.pytest.org/en/latest/)\n- [Code coverage](https://github.com/eugeneyan/testing-ml/blob/master/Makefile#L17) with [`Coverage.py`](https://coverage.readthedocs.io/en/coverage-5.2.1/) and [`pytest-cov`](https://pytest-cov.readthedocs.io/en/latest/)\n- [Linting](https://github.com/eugeneyan/testing-ml/blob/master/Makefile#L23) to ensure code consistency with [`pylint`](https://www.pylint.org)\n- [Type checks](https://github.com/eugeneyan/testing-ml/blob/master/Makefile#L20) to verify type correctness with [`mypy`](http://mypy-lang.org)\n\nMore details here: [How to Set Up a Python Project For Automation and Collaboration](https://eugeneyan.com/writing/setting-up-python-project-for-automation-and-collaboration/) ([GitHub repo](https://github.com/eugeneyan/python-collab-template))\n\n\n## Pre-train tests to ensure correct implementation\n\n- Test implementation of [Gini impurity](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_1pre.py#L8) and [gain](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_1pre.py#L17)\n\n```\ndef test_gini_impurity():\n    assert round(gini_impurity([1, 1, 1, 1, 1, 1, 1, 1]), 3) == 0\n    assert round(gini_impurity([1, 1, 1, 1, 1, 1, 0, 0]), 3) == 0.375\n    assert round(gini_impurity([1, 1, 1, 1, 0, 0, 0, 0]), 3) == 0.500\n\n\ndef test_gini_gain():\n    assert round(gini_gain([1, 1, 1, 1, 0, 0, 0, 0], [[1, 1, 1, 1], [0, 0, 0, 0]]), 3) == 0.5\n    assert round(gini_gain([1, 1, 1, 1, 0, 0, 0, 0], [[1, 1, 1, 0], [0, 0, 0, 1]]), 3) == 0.125\n    assert round(gini_gain([1, 1, 1, 1, 0, 0, 0, 0], [[1, 1, 0, 0], [0, 0, 1, 1]]), 3) == 0.0\n```\n\n- Test [output shape](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_1pre.py#L25)\n\n```\ndef test_dt_output_shape(dummy_titanic):\n    X_train, y_train, X_test, y_test = dummy_titanic\n    dt = DecisionTree()\n    dt.fit(X_train, y_train)\n    pred_train, pred_test = dt.predict(X_train), dt.predict(X_test)\n\n    assert pred_train.shape == (X_train.shape[0],), 'DecisionTree output should be same as training labels.'\n    assert pred_test.shape == (X_test.shape[0],), 'DecisionTree output should be same as testing labels.'\n```\n\n- Test [data leak](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_1pre.py#103) between train and test set\n\n```\ndef test_data_leak_in_test_data(dummy_titanic_df):\n    train, test = dummy_titanic_df\n\n    concat_df = pd.concat([train, test])\n    concat_df.drop_duplicates(inplace=True)\n\n    assert concat_df.shape[0] == train.shape[0] + test.shape[0]\n```\n\n- Test [output range](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_1pre.py#L44)\n\n```\ndef test_dt_output_range(dummy_titanic):\n    X_train, y_train, X_test, y_test = dummy_titanic\n    dt = DecisionTree()\n    dt.fit(X_train, y_train)\n    pred_train, pred_test = dt.predict(X_train), dt.predict(X_test)\n\n    assert (pred_train \u003c= 1).all() \u0026 (pred_train \u003e= 0).all(), 'Decision tree output should range from 0 to 1 inclusive'\n    assert (pred_test \u003c= 1).all() \u0026 (pred_test \u003e= 0).all(), 'Decision tree output should range from 0 to 1 inclusive'\n```\n\n- Test model able to [overfit on perfectly separable data](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_1pre.py#L63)\n\n```\ndef test_dt_overfit(dummy_feats_and_labels):\n    feats, labels = dummy_feats_and_labels\n    dt = DecisionTree()\n    dt.fit(feats, labels)\n    pred = np.round(dt.predict(feats))\n\n    assert np.array_equal(labels, pred), 'DecisionTree should fit data perfectly and prediction should == labels.'\n```\n\n- Test additional tree depth [increases training accuracy and AUC ROC](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_1pre.py#L85)\n\n```\ndef test_dt_increase_acc(dummy_titanic):\n    X_train, y_train, _, _ = dummy_titanic\n\n    acc_list, auc_list = [], []\n    for depth in range(1, 10):\n        dt = DecisionTree(depth_limit=depth)\n        dt.fit(X_train, y_train)\n        pred = dt.predict(X_train)\n        pred_binary = np.round(pred)\n        acc_list.append(accuracy_score(y_train, pred_binary))\n        auc_list.append(roc_auc_score(y_train, pred))\n\n    assert sorted(acc_list) == acc_list, 'Accuracy should increase as tree depth increases.'\n    assert sorted(auc_list) == auc_list, 'AUC ROC should increase as tree depth increases.'\n```\n\n- Test additional trees in `RandomForest` [improves validation accuracy and AUC ROC](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_random_forest_1pre.py#L18)\n\n```\ndef test_dt_increase_acc(dummy_titanic):\n    X_train, y_train, X_test, y_test = dummy_titanic\n\n    acc_list, auc_list = [], []\n    for num_trees in [1, 3, 7, 15]:\n        rf = RandomForest(num_trees=num_trees, depth_limit=7, col_subsampling=0.7, row_subsampling=0.7)\n        rf.fit(X_train, y_train)\n        pred = rf.predict(X_test)\n        pred_binary = np.round(pred)\n        acc_list.append(accuracy_score(y_test, pred_binary))\n        auc_list.append(roc_auc_score(y_test, pred))\n\n    assert sorted(acc_list) == acc_list, 'Accuracy should increase as number of trees increases.'\n    assert sorted(auc_list) == auc_list, 'AUC ROC should increase as number of trees increases.'\n```\n\n- Test `RandomForest` [outperforms](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_random_forest_1pre.py#L36) `DecisionTree` given the same tree depth\n\n```\ndef test_rf_better_than_dt(dummy_titanic):\n    X_train, y_train, X_test, y_test = dummy_titanic\n\n    dt = DecisionTree(depth_limit=10)\n    dt.fit(X_train, y_train)\n\n    rf = RandomForest(depth_limit=10, num_trees=7, col_subsampling=0.8, row_subsampling=0.8)\n    rf.fit(X_train, y_train)\n\n    pred_test_dt = dt.predict(X_test)\n    pred_test_binary_dt = np.round(pred_test_dt)\n    acc_test_dt = accuracy_score(y_test, pred_test_binary_dt)\n    auc_test_dt = roc_auc_score(y_test, pred_test_dt)\n\n    pred_test_rf = rf.predict(X_test)\n    pred_test_binary_rf = np.round(pred_test_rf)\n    acc_test_rf = accuracy_score(y_test, pred_test_binary_rf)\n    auc_test_rf = roc_auc_score(y_test, pred_test_rf)\n\n    assert acc_test_rf \u003e acc_test_dt, 'RandomForest should have higher accuracy than DecisionTree on test set.'\n    assert auc_test_rf \u003e auc_test_dt, 'RandomForest should have higher AUC ROC than DecisionTree on test set.'\n```\n\n## Post-train tests to ensure expected learned behaviour\n- Test [invariance](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_2post.py#L8) (e.g., ticket number should not affect survival probability)\n\n```\ndef test_dt_invariance(dummy_titanic_dt, dummy_passengers):\n    model = dummy_titanic_dt\n    _, p2 = dummy_passengers\n\n    # Get original survival probability of passenger 2\n    test_df = pd.DataFrame.from_dict([p2], orient='columns')\n    X, y = get_feats_and_labels(prep_df(test_df))\n    p2_prob = model.predict(X)[0]  # 1.0\n\n    # Change ticket number from 'PC 17599' to 'A/5 21171'\n    p2_ticket = p2.copy()\n    p2_ticket['ticket'] = 'A/5 21171'\n    test_df = pd.DataFrame.from_dict([p2_ticket], orient='columns')\n    X, y = get_feats_and_labels(prep_df(test_df))\n    p2_ticket_prob = model.predict(X)[0]  # 1.0\n\n    assert p2_prob == p2_ticket_prob\n```\n\n- Test [directional expectation](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_2post.py#L69) (e.g., females should have higher survival probability than males)\n\n```\ndef test_dt_directional_expectation(dummy_titanic_dt, dummy_passengers):\n    model = dummy_titanic_dt\n    _, p2 = dummy_passengers\n\n    # Get original survival probability of passenger 2\n    test_df = pd.DataFrame.from_dict([p2], orient='columns')\n    X, y = get_feats_and_labels(prep_df(test_df))\n    p2_prob = model.predict(X)[0]  # 1.0\n\n    # Change gender from female to male\n    p2_male = p2.copy()\n    p2_male['Name'] = ' Mr. John'\n    p2_male['Sex'] = 'male'\n    test_df = pd.DataFrame.from_dict([p2_male], orient='columns')\n    X, y = get_feats_and_labels(prep_df(test_df))\n    p2_male_prob = model.predict(X)[0]  # 0.56\n\n    # Change class from 1 to 3\n    p2_class = p2.copy()\n    p2_class['Pclass'] = 3\n    test_df = pd.DataFrame.from_dict([p2_class], orient='columns')\n    X, y = get_feats_and_labels(prep_df(test_df))\n    p2_class_prob = model.predict(X)[0]  # 0.0\n\n    assert p2_prob \u003e p2_male_prob, 'Changing gender from female to male should decrease survival probability.'\n    assert p2_prob \u003e p2_class_prob, 'Changing class from 1 to 3 should decrease survival probability.'\n```\n\t\n## Evaluation to ensure satisfactory model performance\n\n- Evaluation on [accuracy and AUC ROC](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_3eval.py#L10)\n\n```\ndef test_dt_evaluation(dummy_titanic_dt, dummy_titanic):\n    model = dummy_titanic_dt\n    X_train, y_train, X_test, y_test = dummy_titanic\n    pred_test = model.predict(X_test)\n    pred_test_binary = np.round(pred_test)\n    acc_test = accuracy_score(y_test, pred_test_binary)\n    auc_test = roc_auc_score(y_test, pred_test)\n\n    assert acc_test \u003e 0.82, 'Accuracy on test should be \u003e 0.82'\n    assert auc_test \u003e 0.84, 'AUC ROC on test should be \u003e 0.84'\n```\n\n- Evaluation on [training](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_3eval.py#L31) and [inference](https://github.com/eugeneyan/testing-ml/blob/master/tests/tree/test_decision_tree_3eval.py#L41) times\n\n```\ndef test_dt_training_time(dummy_titanic):\n    X_train, y_train, X_test, y_test = dummy_titanic\n\n    # Standardize to use depth = 10\n    dt = DecisionTree(depth_limit=10)\n    latency_array = np.array([train_with_time(dt, X_train, y_train)[1] for i in range(100)])\n    time_p95 = np.quantile(latency_array, 0.95)\n    assert time_p95 \u003c 1.0, 'Training time at 95th percentile should be \u003c 1.0 sec'\n\n\ndef test_dt_serving_latency(dummy_titanic):\n    X_train, y_train, X_test, y_test = dummy_titanic\n\n    # Standardize to use depth = 10\n    dt = DecisionTree(depth_limit=10)\n    dt.fit(X_train, y_train)\n\n    latency_array = np.array([predict_with_time(dt, X_test)[1] for i in range(500)])\n    latency_p99 = np.quantile(latency_array, 0.99)\n    assert latency_p99 \u003c 0.004, 'Serving latency at 99th percentile should be \u003c 0.004 sec'\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feugeneyan%2Ftesting-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feugeneyan%2Ftesting-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feugeneyan%2Ftesting-ml/lists"}