{"id":15646054,"url":"https://github.com/gokumohandas/testing-ml","last_synced_at":"2025-06-14T11:35:39.003Z","repository":{"id":54723810,"uuid":"520281400","full_name":"GokuMohandas/testing-ml","owner":"GokuMohandas","description":"Learn how to create reliable ML systems by testing code, data and models.","archived":false,"fork":false,"pushed_at":"2022-09-12T11:58:43.000Z","size":29,"stargazers_count":86,"open_issues_count":0,"forks_count":13,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-30T07:36:10.724Z","etag":null,"topics":["great-expectations","machine-learning","mlops","pytest","testing"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GokuMohandas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-08-01T22:41:32.000Z","updated_at":"2025-01-18T07:33:14.000Z","dependencies_parsed_at":"2022-08-14T00:50:38.917Z","dependency_job_id":null,"html_url":"https://github.com/GokuMohandas/testing-ml","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/GokuMohandas/testing-ml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Ftesting-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Ftesting-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Ftesting-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Ftesting-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GokuMohandas","download_url":"https://codeload.github.com/GokuMohandas/testing-ml/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Ftesting-ml/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259809769,"owners_count":22914936,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["great-expectations","machine-learning","mlops","pytest","testing"],"created_at":"2024-10-03T12:11:10.955Z","updated_at":"2025-06-14T11:35:38.935Z","avatar_url":"https://github.com/GokuMohandas.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Testing ML\n\nLearn how to create reliable ML systems by testing code, data and models.\n\n\u003cdiv align=\"left\"\u003e\n    \u003ca target=\"_blank\" href=\"https://madewithml.com\"\u003e\u003cimg src=\"https://img.shields.io/badge/Subscribe-40K-brightgreen\"\u003e\u003c/a\u003e\u0026nbsp;\n    \u003ca target=\"_blank\" href=\"https://github.com/GokuMohandas/Made-With-ML\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/GokuMohandas/Made-With-ML.svg?style=social\u0026label=Star\"\u003e\u003c/a\u003e\u0026nbsp;\n    \u003ca target=\"_blank\" href=\"https://www.linkedin.com/in/goku\"\u003e\u003cimg src=\"https://img.shields.io/badge/style--5eba00.svg?label=LinkedIn\u0026logo=linkedin\u0026style=social\"\u003e\u003c/a\u003e\u0026nbsp;\n    \u003ca target=\"_blank\" href=\"https://twitter.com/GokuMohandas\"\u003e\u003cimg src=\"https://img.shields.io/twitter/follow/GokuMohandas.svg?label=Follow\u0026style=social\"\u003e\u003c/a\u003e\n    \u003cbr\u003e\n\u003c/div\u003e\n\n\u003cbr\u003e\n\n👉 \u0026nbsp;This repository contains the [interactive notebook](https://colab.research.google.com/github/GokuMohandas/testing-ml/blob/main/testing.ipynb) that complements the [testing lesson](https://madewithml.com/courses/mlops/testing/), which is a part of the [MLOps course](https://github.com/GokuMohandas/mlops-course). If you haven't already, be sure to check out the [lesson](https://madewithml.com/courses/mlops/testing/) because all the concepts are covered extensively and tied to software engineering best practices for building ML systems.\n\n\u003cdiv align=\"left\"\u003e\n\u003ca target=\"_blank\" href=\"https://madewithml.com/courses/mlops/testing/\"\u003e\u003cimg src=\"https://img.shields.io/badge/📖 Read-lesson-9cf\"\u003e\u003c/a\u003e\u0026nbsp;\n\u003ca href=\"https://github.com/GokuMohandas/testing-ml/blob/main/testing.ipynb\" role=\"button\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026amp;message=View%20On%20GitHub\u0026amp;color=586069\u0026amp;logo=github\u0026amp;labelColor=2f363d\"\u003e\u003c/a\u003e\u0026nbsp;\n\u003ca href=\"https://colab.research.google.com/github/GokuMohandas/testing-ml/blob/main/testing.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n\u003cbr\u003e\n\n- [Data](#data)\n    - [Expectations](#expectations)\n    - [Production](#production)\n- [Models](#models)\n    - [Training](#training)\n    - [Behavioral](#behavioral)\n    - [Adversarial](#adversarial)\n    - [Inference](#inference)\n\n## Data\n\nTools such as [pytest](https://madewithml.com/courses/mlops/testing/#pytest) allow us to test the functions that interact with our data but not the validity of the data itself. We're going to use the [great expectations](https://github.com/great-expectations/great_expectations) library to create expectations as to what our data should look like in a standardized way.\n\n```bash\n!pip install great-expectations==0.15.15 -q\n```\n\n```python\nimport great_expectations as ge\nimport json\nimport pandas as pd\nfrom urllib.request import urlopen\n```\n\n```python\n# Load labeled projects\nprojects = pd.read_csv(\"https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/projects.csv\")\ntags = pd.read_csv(\"https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/tags.csv\")\ndf = ge.dataset.PandasDataset(pd.merge(projects, tags, on=\"id\"))\nprint (f\"{len(df)} projects\")\ndf.head(5)\n```\n\n\u003cdiv class=\"output_subarea output_html rendered_html output_result\" dir=\"auto\"\u003e\u003cdiv\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eid\u003c/th\u003e\n      \u003cth\u003ecreated_on\u003c/th\u003e\n      \u003cth\u003etitle\u003c/th\u003e\n      \u003cth\u003edescription\u003c/th\u003e\n      \u003cth\u003etag\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e6\u003c/td\u003e\n      \u003ctd\u003e2020-02-20 06:43:18\u003c/td\u003e\n      \u003ctd\u003eComparison between YOLO and RCNN on real world...\u003c/td\u003e\n      \u003ctd\u003eBringing theory to experiment is cool. We can ...\u003c/td\u003e\n      \u003ctd\u003ecomputer-vision\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e7\u003c/td\u003e\n      \u003ctd\u003e2020-02-20 06:47:21\u003c/td\u003e\n      \u003ctd\u003eShow, Infer \u0026amp; Tell: Contextual Inference for C...\u003c/td\u003e\n      \u003ctd\u003eThe beauty of the work lies in the way it arch...\u003c/td\u003e\n      \u003ctd\u003ecomputer-vision\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e9\u003c/td\u003e\n      \u003ctd\u003e2020-02-24 16:24:45\u003c/td\u003e\n      \u003ctd\u003eAwesome Graph Classification\u003c/td\u003e\n      \u003ctd\u003eA collection of important graph embedding, cla...\u003c/td\u003e\n      \u003ctd\u003egraph-learning\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e15\u003c/td\u003e\n      \u003ctd\u003e2020-02-28 23:55:26\u003c/td\u003e\n      \u003ctd\u003eAwesome Monte Carlo Tree Search\u003c/td\u003e\n      \u003ctd\u003eA curated list of Monte Carlo tree search papers...\u003c/td\u003e\n      \u003ctd\u003ereinforcement-learning\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e19\u003c/td\u003e\n      \u003ctd\u003e2020-03-03 13:54:31\u003c/td\u003e\n      \u003ctd\u003eDiffusion to Vector\u003c/td\u003e\n      \u003ctd\u003eReference implementation of Diffusion2Vec (Com...\u003c/td\u003e\n      \u003ctd\u003egraph-learning\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\u003c/div\u003e\n\n### Expectations\n\nWhen it comes to creating expectations as to what our data should look like, we want to think about our entire dataset and all the features (columns) within it.\n\n```python\n# Presence of specific features\ndf.expect_table_columns_to_match_ordered_list(\n    column_list=[\"id\", \"created_on\", \"title\", \"description\", \"tag\"]\n)\n```\n\n```python\n# Unique combinations of features (detect data leaks!)\ndf.expect_compound_columns_to_be_unique(column_list=[\"title\", \"description\"])\n```\n\n```python\n# Missing values\ndf.expect_column_values_to_not_be_null(column=\"tag\")\n```\n\n```python\n# Unique values\ndf.expect_column_values_to_be_unique(column=\"id\")\n```\n\n```python\n# Type adherence\ndf.expect_column_values_to_be_of_type(column=\"title\", type_=\"str\")\n```\n\n```python\n# List (categorical) / range (continuous) of allowed values\ntags = [\"computer-vision\", \"graph-learning\", \"reinforcement-learning\",\n        \"natural-language-processing\", \"mlops\", \"time-series\"]\ndf.expect_column_values_to_be_in_set(column=\"tag\", value_set=tags)\n```\n\nThere are just a few of the different expectations that we can create. Be sure to explore all the [expectations](https://greatexpectations.io/expectations/), including [custom expectations](https://docs.greatexpectations.io/docs/guides/expectations/creating_custom_expectations/overview/). Here are some other popular expectations that don't pertain to our specific dataset but are widely applicable:\n\n- feature value relationships with other feature values → `expect_column_pair_values_a_to_be_greater_than_b`\n- row count (exact or range) of samples → `expect_table_row_count_to_be_between`\n- value statistics (mean, std, median, max, min, sum, etc.) → `expect_column_mean_to_be_between`\n\n### Production\n\nThe advantage of using a library such as great expectations, as opposed to isolated assert statements is that we can:\n\n- reduce redundant efforts for creating tests across data modalities\n- automatically create testing [checkpoints](https://madewithml.com/courses/mlops/testing#checkpoints) to execute as our dataset grows\n- automatically generate [documentation](https://madewithml.com/courses/mlops/testing#documentation) on expectations and report on runs\n- easily connect with backend data sources such as local file systems, S3, databases, etc.\n\n```python\n# Run all tests on our DataFrame at once\nexpectation_suite = df.get_expectation_suite(discard_failed_expectations=False)\ndf.validate(expectation_suite=expectation_suite, only_return_failures=True)\n```\n\n```json\n\"success\": true,\n\"evaluation_parameters\": {},\n\"results\": [],\n\"statistics\": {\n    \"evaluated_expectations\": 6,\n    \"successful_expectations\": 6,\n    \"unsuccessful_expectations\": 0,\n    \"success_percent\": 100.0\n}\n```\n\nMany of these expectations will be executed when the data is extracted, loaded and transformed during our [DataOps workflows](https://madewithml.com/courses/mlops/orchestration#dataops). Typically, the data will be extracted from a source ([database](https://madewithml.com/courses/mlops/data-stack#database), [API](https://madewithml.com/courses/mlops/api), etc.) and loaded into a data system (ex. [data warehouse](https://madewithml.com/courses/mlops/data-stack#data-warehouse)) before being transformed there (ex. using [dbt](https://www.getdbt.com/)) for downstream applications. Throughout these tasks, Great Expectations checkpoint validations can be run to ensure the validity of the data and the changes applied to it.\n\n\u003cimg width=\"700\" src=\"https://madewithml.com/static/images/mlops/testing/production.png\" alt=\"ETL pipelines in production\"\u003e\n\n\n## Models\n\nOnce we've tested our data, we can use it for downstream applications such as training machine learning models. It's important that we also test these model artifacts to ensure reliable behavior in our application.\n\n### Training\n\nUnlike traditional software, ML models can run to completion without throwing any exceptions / errors but can produce incorrect systems. We want to catch errors quickly to save on time and compute.\n\n- Check shapes and values of model output\n```python\nassert model(inputs).shape == torch.Size([len(inputs), num_classes])\n```\n- Check for decreasing loss after one batch of training\n```python\nassert epoch_loss \u003c prev_epoch_loss\n```\n- Overfit on a batch\n```python\naccuracy = train(model, inputs=batches[0])\nassert accuracy == pytest.approx(0.95, abs=0.05) # 0.95 ± 0.05\n```\n- Train to completion (tests early stopping, saving, etc.)\n```python\ntrain(model)\nassert learning_rate \u003e= min_learning_rate\nassert artifacts\n```\n- On different devices\n```python\nassert train(model, device=torch.device(\"cpu\"))\nassert train(model, device=torch.device(\"cuda\"))\n```\n\n### Behavioral\n\nBehavioral testing is the process of testing input data and expected outputs while treating the model as a black box (model agnostic evaluation). A landmark paper on this topic is [Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://arxiv.org/abs/2005.04118) which breaks down behavioral testing into three types of tests:\n\n- `invariance`: Changes should not affect outputs.\n```python\n# INVariance via verb injection (changes should not affect outputs)\ntokens = [\"revolutionized\", \"disrupted\"]\ntexts = [f\"Transformers applied to NLP have {token} the ML field.\" for token in tokens]\npredict.predict(texts=texts, artifacts=artifacts)\n```\n\u003cpre class=\"output\"\u003e\n['natural-language-processing', 'natural-language-processing']\n\u003c/pre\u003e\n- `directional`: Change should affect outputs.\n```python\n# DIRectional expectations (changes with known outputs)\ntokens = [\"text classification\", \"image classification\"]\ntexts = [f\"ML applied to {token}.\" for token in tokens]\npredict.predict(texts=texts, artifacts=artifacts)\n```\n\u003cpre class=\"output\"\u003e\n['natural-language-processing', 'computer-vision']\n\u003c/pre\u003e\n- `minimum functionality`: Simple combination of inputs and expected outputs.\n```python\n# Minimum Functionality Tests (simple input/output pairs)\ntokens = [\"natural language processing\", \"mlops\"]\ntexts = [f\"{token} is the next big wave in machine learning.\" for token in tokens]\npredict.predict(texts=texts, artifacts=artifacts)\n```\n\u003cpre class=\"output\"\u003e\n['natural-language-processing', 'mlops']\n\u003c/pre\u003e\n\n### Adversarial\n\nBehavioral testing can be extended to adversarial testing where we test to see how the model would perform under edge cases, bias, noise, etc.\n\n```python\ntexts = [\n    \"CNNs for text classification.\",  # CNNs are typically seen in computer-vision projects\n    \"This should not produce any relevant topics.\"  # should predict `other` label\n]\npredict.predict(texts=texts, artifacts=artifacts)\n```\n\u003cpre class=\"output\"\u003e\n    ['natural-language-processing', 'other']\n\u003c/pre\u003e\n\n### Inference\n\nWhen our model is deployed, most users will be using it for inference (directly / indirectly), so it's very important that we test all aspects of it.\n\n#### Loading artifacts\n\nThis is the first time we're not loading our components from in-memory so we want to ensure that the required artifacts (model weights, encoders, config, etc.) are all able to be loaded.\n\n```python\nartifacts = main.load_artifacts(run_id=run_id)\nassert isinstance(artifacts[\"label_encoder\"], data.LabelEncoder)\n...\n```\n\n#### Prediction\n\nOnce we have our artifacts loaded, we're readying to test our prediction pipelines. We should test samples with just one input, as well as a batch of inputs (ex. padding can have unintended consequences sometimes).\n```python\n# test our API call directly\ndata = {\n    \"texts\": [\n        {\"text\": \"Transfer learning with transformers for text classification.\"},\n        {\"text\": \"Generative adversarial networks in both PyTorch and TensorFlow.\"},\n    ]\n}\nresponse = client.post(\"/predict\", json=data)\nassert response.status_code == HTTPStatus.OK\nassert response.request.method == \"POST\"\nassert len(response.json()[\"data\"][\"predictions\"]) == len(data[\"texts\"])\n...\n```\n\n## Learn more\n\nWhile these are the foundational concepts for testing ML systems, there are a lot of software best practices for testing that we cannot show in an isolated repository. Learn a lot more about comprehensively testing code, data and models for ML systems in our [testing lesson](https://madewithml.com/courses/mlops/testing/).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgokumohandas%2Ftesting-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgokumohandas%2Ftesting-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgokumohandas%2Ftesting-ml/lists"}