{"id":13718800,"url":"https://github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code","last_synced_at":"2025-05-07T10:34:00.476Z","repository":{"id":117502246,"uuid":"246455669","full_name":"Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code","owner":"Neuraxio","description":"A coding exercise: let's convert dirty machine learning code into clean code using a Pipeline - which is the Pipe and Filter Design Pattern applied to Machine Learning.","archived":false,"fork":false,"pushed_at":"2022-11-03T23:59:34.000Z","size":351,"stargazers_count":17,"open_issues_count":1,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-05T20:26:38.073Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Neuraxio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-03-11T02:26:46.000Z","updated_at":"2025-02-28T09:48:42.000Z","dependencies_parsed_at":"2024-01-07T10:56:39.504Z","dependency_job_id":null,"html_url":"https://github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Neuraxio%2FKata-Clean-Machine-Learning-From-Dirty-Code","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Neuraxio%2FKata-Clean-Machine-Learning-From-Dirty-Code/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Neuraxio%2FKata-Clean-Machine-Learning-From-Dirty-Code/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Neuraxio%2FKata-Clean-Machine-Learning-From-Dirty-Code/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Neuraxio","download_url":"https://codeload.github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252860024,"owners_count":21815450,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T01:00:37.874Z","updated_at":"2025-05-07T10:34:00.114Z","avatar_url":"https://github.com/Neuraxio.png","language":"Jupyter Notebook","readme":"# Kata: Clean Machine Learning From Dirty Code\n\n## First, open the Kata in Google Colab (or else download it)\n\nYou can clone this project and launch jupyter-notebook, or use the files in Google Colab here: \n\n- https://drive.google.com/drive/u/0/folders/12uzcNKU7n0EUyFzgitSt1wSaSvV4qJbs\n\nYou may want to do `File \u003e Save a copy in Drive...` in Colab to edit your own copy of the file.\n\n___\n\n# Kata 1: Refactor Dirty ML Code into Pipeline\n\nLet's convert dirty machine learning code into clean code using a [Pipeline](https://stackoverflow.com/a/60303302/2476920) - which is the [Pipe and Filter Design Pattern](https://docs.microsoft.com/en-us/azure/architecture/patterns/pipes-and-filters) for Machine Learning. \n\nAt first you may still wonder *why* using this Design Patterns is good. You'll realize just how good it is in the 2nd [Clean Machine Learning Kata](https://github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code) when you'll do AutoML. Pipelines will give you the ability to easily manage the hyperparameters and the hyperparameter space, on a per-step basis. You'll also have the good code structure for training, saving, reloading, and deploying using any library you want without hitting a wall when it'll come to serializing your whole trained pipeline for deploying in prod.\n\n\n## The Dataset\n\nIt'll be downloaded automatically for you in the code below. \n\nWe're using a Human Activity Recognition (HAR) dataset captured using smartphones. The [dataset](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) can be found on the UCI Machine Learning Repository. \n\n### The task\n\nClassify the type of movement amongst six categories from the phones' sensor data:\n- WALKING,\n- WALKING_UPSTAIRS,\n- WALKING_DOWNSTAIRS,\n- SITTING,\n- STANDING,\n- LAYING.\n\n### Video dataset overview\n\nFollow this link to see a video of the 6 activities recorded in the experiment with one of the participants:\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"http://www.youtube.com/watch?feature=player_embedded\u0026v=XOEN9W05_4A\n\" target=\"_blank\"\u003e\u003cimg src=\"http://img.youtube.com/vi/XOEN9W05_4A/0.jpg\" \nalt=\"Video of the experiment\" width=\"400\" height=\"300\" border=\"10\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://youtu.be/XOEN9W05_4A\"\u003e\u003ccenter\u003e[Watch video]\u003c/center\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n### Details about the input data\n\nThe dataset's description goes like this:\n\n\u003e The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. \n\nReference: \n\u003e Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.\n\nThat said, I will use the almost raw data: only the gravity effect has been filtered out of the accelerometer  as a preprocessing step for another 3D feature as an input to help learning. If you'd ever want to extract the gravity by yourself, you could use the following [Butterworth Low-Pass Filter (LPF)](https://github.com/guillaume-chevalier/filtering-stft-and-laplace-transform) and edit it to have the right cutoff frequency of 0.3 Hz which is a good frequency for activity recognition from body sensors.\n\nHere is how the 3D data cube looks like. So we'll have a train and a test data cube, and might create validation data cubes as well: \n\n![](time-series-data.jpg)\n\nSo we have 3D data of shape `[batch_size, time_steps, features]`. If this and the above is still unclear to you, you may want to [learn more on the 3D shape of time series data](https://www.quora.com/What-do-samples-features-time-steps-mean-in-LSTM/answer/Guillaume-Chevalier-2).\n\n## Loading the Dataset\n\n\n```python\nimport urllib\nimport os\n\ndef download_import(filename):\n    with open(filename, \"wb\") as f:\n        # Downloading like that is needed because of Colab operating from a Google Drive folder that is only \"shared with you\".\n        url = 'https://raw.githubusercontent.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code/master/{}'.format(filename)\n        f.write(urllib.request.urlopen(url).read())\n\ntry:\n    import google.colab\n    download_import(\"data_loading.py\")\n    !mkdir data;\n    download_import(\"data/download_dataset.py\")\n    print(\"Downloaded .py files: dataset loaders.\")\nexcept:\n    print(\"No dynamic .py file download needed: not in a Colab.\")\n\nDATA_PATH = \"data/\"\n!pwd \u0026\u0026 ls\nos.chdir(DATA_PATH)\n!pwd \u0026\u0026 ls\n!python download_dataset.py\n!pwd \u0026\u0026 ls\nos.chdir(\"..\")\n!pwd \u0026\u0026 ls\nDATASET_PATH = DATA_PATH + \"UCI HAR Dataset/\"\nprint(\"\\n\" + \"Dataset is now located at: \" + DATASET_PATH)\n```\n\n\n```python\n# install neuraxle if needed:\ntry:\n    import neuraxle\n    assert neuraxle.__version__ == '0.3.4'\nexcept:\n    !pip install neuraxle==0.3.4\n```\n\n\n```python\n# Finally load dataset!\nfrom data_loading import load_all_data\nX_train, y_train, X_test, y_test = load_all_data()\nprint(\"Dataset loaded!\")\n```\n\n## Let's now define and execute our ugly code: \n\nYou don't need to change the functions here just below. We'll rather code this again after in the next section.\n\n\n```python\nimport numpy as np\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.tree import DecisionTreeClassifier\n\n\ndef get_fft_peak_infos(real_fft, time_bins_axis=-2):\n    \"\"\"\n    Extract the indices of the bins with maximal amplitude, and the corresponding amplitude values.\n\n    :param fft: real magnitudes of an fft. It could be of shape [N, bins, features].\n    :param time_bins_axis: axis of the frequency bins (e.g.: time axis before fft).\n    :return: Two arrays without bins. One is an int, the other is a float. Shape: ([N, features], [N, features])\n    \"\"\"\n    peak_bin = np.argmax(real_fft, axis=time_bins_axis)\n    peak_bin_val = np.max(real_fft, axis=time_bins_axis)\n    return peak_bin, peak_bin_val\n\n\ndef fft_magnitudes(data_inputs, time_axis=-2):\n    \"\"\"\n    Apply a Fast Fourier Transform operation to analyze frequencies, and return real magnitudes.\n    The bins past the half (past the nyquist frequency) are discarded, which result in shorter time series.\n\n    :param data_inputs: ND array of dimension at least 1. For instance, this could be of shape [N, time_axis, features]\n    :param time_axis: axis along which the time series evolve\n    :return: real magnitudes of the data_inputs. For instance, this could be of shape [N, (time_axis / 2) + 1, features]\n             so here, we have `bins = (time_axis / 2) + 1`.\n    \"\"\"\n    fft = np.fft.rfft(data_inputs, axis=time_axis)\n    real_fft = np.absolute(fft)\n    return real_fft\n\n\ndef get_fft_features(x_data):\n    \"\"\"\n    Will featurize data with an FFT.\n\n    :param x_data: 3D time series of shape [batch_size, time_steps, sensors]\n    :return: featurized time series with FFT of shape [batch_size, features]\n    \"\"\"\n    real_fft = fft_magnitudes(x_data)\n    flattened_fft = real_fft.reshape(real_fft.shape[0], -1)\n    peak_bin, peak_bin_val = get_fft_peak_infos(real_fft)\n    return flattened_fft, peak_bin, peak_bin_val\n\n\ndef featurize_data(x_data):\n    \"\"\"\n    Will convert 3D time series of shape [batch_size, time_steps, sensors] to shape [batch_size, features]\n    to prepare data for machine learning.\n\n    :param x_data: 3D time series of shape [batch_size, time_steps, sensors]\n    :return: featurized time series of shape [batch_size, features]\n    \"\"\"\n    print(\"Input shape before feature union:\", x_data.shape)\n\n    flattened_fft, peak_bin, peak_bin_val = get_fft_features(x_data)\n    mean = np.mean(x_data, axis=-2)\n    median = np.median(x_data, axis=-2)\n    min = np.min(x_data, axis=-2)\n    max = np.max(x_data, axis=-2)\n\n    featurized_data = np.concatenate([\n        flattened_fft,\n        peak_bin,\n        peak_bin_val,\n        mean,\n        median,\n        min,\n        max,\n    ], axis=-1)\n\n    print(\"Shape after feature union, before classification:\", featurized_data.shape)\n    return featurized_data\n\n```\n\nLet's now use the ugly code to do ugly machine learning with it.\n\nFit: \n\n\n```python\n\n# Shape: [batch_size, time_steps, sensor_features]\nX_train_featurized = featurize_data(X_train)\n# Shape: [batch_size, remade_features]\n\nclassifier = DecisionTreeClassifier()\nclassifier.fit(X_train_featurized, y_train)\n\n```\n\nPredict:\n\n\n```python\n\n# Shape: [batch_size, time_steps, sensor_features]\nX_test_featurized = featurize_data(X_test)\n# Shape: [batch_size, remade_features]\n\ny_pred = classifier.predict(X_test_featurized)\nprint(\"Shape at output after classification:\", y_pred.shape)\n# Shape: [batch_size]\n\n```\n\nEval:\n\n\n```python\n\naccuracy = accuracy_score(y_pred=y_pred, y_true=y_test)\nprint(\"Accuracy of ugly pipeline code:\", accuracy)\n\n```\n\n## Cleaning Up: Define Pipeline Steps and a Pipeline\n\nThe kata is to fill the classes below and to use them properly in the pipeline thereafter. \n\nThere are some missing classes as well that you should define.\n\n\n```python\nfrom neuraxle.base import BaseStep, NonFittableMixin\nfrom neuraxle.steps.numpy import NumpyConcatenateInnerFeatures, NumpyShapePrinter, NumpyFlattenDatum\n\nclass NumpyFFT(NonFittableMixin, BaseStep):\n    def transform(self, data_inputs):\n        \"\"\"\n        Featurize time series data with FFT.\n\n        :param data_inputs: time series data of 3D shape: [batch_size, time_steps, sensors_readings]\n        :return: featurized data is of 2D shape: [batch_size, n_features]\n        \"\"\"\n        transformed_data = np.fft.rfft(data_inputs, axis=-2)\n        return transformed_data\n\n\nclass FFTPeakBinWithValue(NonFittableMixin, BaseStep):\n    def transform(self, data_inputs):\n        \"\"\"\n        Will compute peak fft bins (int), and their magnitudes' value (float), to concatenate them.\n\n        :param data_inputs: real magnitudes of an fft. It could be of shape [batch_size, bins, features].\n        :return: Two arrays without bins concatenated on feature axis. Shape: [batch_size, 2 * features]\n        \"\"\"\n        time_bins_axis = -2\n        peak_bin = np.argmax(data_inputs, axis=time_bins_axis)\n        peak_bin_val = np.max(data_inputs, axis=time_bins_axis)\n        \n        # Notice that here another FeatureUnion could have been used with a joiner:\n        transformed = np.concatenate([peak_bin, peak_bin_val], axis=-1)\n        \n        return transformed\n\n\nclass NumpyMedian(NonFittableMixin, BaseStep):\n    def transform(self, data_inputs):\n        \"\"\"\n        Will featurize data with a median.\n\n        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]\n        :return: featurized time series of shape [batch_size, features]\n        \"\"\"\n        return np.median(data_inputs, axis=-2)\n\n\nclass NumpyMean(NonFittableMixin, BaseStep):\n    def transform(self, data_inputs):\n        \"\"\"\n        Will featurize data with a mean.\n\n        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]\n        :return: featurized time series of shape [batch_size, features]\n        \"\"\"\n        raise NotImplementedError(\"TODO\")\n        return ...\n\n```\n\nLet's now create the Pipeline with the code:\n\n\n```python\nfrom neuraxle.base import Identity\nfrom neuraxle.pipeline import Pipeline\nfrom neuraxle.steps.flow import TrainOnlyWrapper\nfrom neuraxle.union import FeatureUnion\n\npipeline = Pipeline([\n    # ToNumpy(),  # Cast type in case it was a list.\n    # For debugging, do this print at train-time only:\n    TrainOnlyWrapper(NumpyShapePrinter(custom_message=\"Input shape before feature union\")),\n    # Shape: [batch_size, time_steps, sensor_features]\n    FeatureUnion([\n        # TODO in kata 1: Fill the classes in this FeatureUnion here and make them work.\n        #      Note that you may comment out some of those feature classes\n        #      temporarily and reactivate them one by one.\n        Pipeline([\n            NumpyFFT(),\n            NumpyAbs(),  # do `np.abs` here.\n            FeatureUnion([\n                NumpyFlattenDatum(),  # Reshape from 3D to flat 2D: flattening data except on batch size\n                FFTPeakBinWithValue()  # Extract 2D features from the 3D FFT bins\n            ], joiner=NumpyConcatenateInnerFeatures())\n        ]),\n        NumpyMean(),\n        NumpyMedian(),\n        NumpyMin(),\n        NumpyMax()\n    ], joiner=NumpyConcatenateInnerFeatures()),  # The joiner will here join like this: np.concatenate([...], axis=-1)\n    # TODO, optional: Add some feature selection right here for the motivated ones:\n    #      https://scikit-learn.org/stable/modules/feature_selection.html\n    TrainOnlyWrapper(NumpyShapePrinter(custom_message=\"Shape after feature union, before classification\")),\n    # Shape: [batch_size, remade_features]\n    # TODO: use an `Inherently multiclass` classifier here from:\n    #       https://scikit-learn.org/stable/modules/multiclass.html\n    YourClassifier(),\n    TrainOnlyWrapper(NumpyShapePrinter(custom_message=\"Shape at output after classification\")),\n    # Shape: [batch_size]\n    Identity()\n])\n\n```\n\n## Test Your Code: Make the Tests Pass\n\nThe 3rd test is the real deal.\n\n\n```python\ndef _test_is_pipeline(pipeline):\n    assert isinstance(pipeline, Pipeline)\n\n\ndef _test_has_all_data_preprocessors(pipeline):\n    assert \"DecisionTreeClassifier\" in pipeline\n    assert \"FeatureUnion\" in pipeline\n    assert \"Pipeline\" in pipeline[\"FeatureUnion\"]\n    assert \"NumpyMean\" in pipeline[\"FeatureUnion\"]\n    assert \"NumpyMedian\" in pipeline[\"FeatureUnion\"]\n    assert \"NumpyMin\" in pipeline[\"FeatureUnion\"]\n    assert \"NumpyMax\" in pipeline[\"FeatureUnion\"]\n\n\ndef _test_pipeline_words_and_has_ok_score(pipeline):\n    pipeline = pipeline.fit(X_train, y_train)\n    \n    y_pred = pipeline.predict(X_test)\n    \n    accuracy = accuracy_score(y_test, y_pred)\n    print(\"Test accuracy score:\", accuracy)\n    assert accuracy \u003e 0.7\n\n\nif __name__ == '__main__':\n    tests = [_test_is_pipeline, _test_has_all_data_preprocessors, _test_pipeline_words_and_has_ok_score]\n    for t in tests:\n        try:\n            t(pipeline)\n            print(\"==\u003e Test '{}(pipeline)' succeed!\".format(t.__name__))\n        except Exception as e:\n            print(\"==\u003e Test '{}(pipeline)' failed:\".format(t.__name__))\n            import traceback\n            print(traceback.format_exc())\n\n```\n\n## Good job!\n\nYour code should now be clean after making the tests pass.\n\n## You're ready for the [Kata 2](https://github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code#kata-clean-machine-learning-from-dirty-code).\n\nYou should now be ready for the 2nd [Clean Machine Learning Kata](https://github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code#kata-clean-machine-learning-from-dirty-code). Note that the solutions are available in the repository above as well. You may use the links to the Google Colab files to try to solve the Katas. \n\n___\n\n## Recommended additional readings and learning resources: \n\n- For more info on clean machine learning, you may want to read [How to Code Neat Machine Learning Pipelines](https://www.neuraxio.com/en/blog/neuraxle/2019/10/26/neat-machine-learning-pipelines.html).\n- For reaching higher performances, you could use a [LSTM Recurrent Neural Network](https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition) and refactoring it into a neat pipeline as you've created here, now by [using TensorFlow in your ML pipeline](https://github.com/Neuraxio/Neuraxle-TensorFlow).\n- You may as well want to request [more training and coaching for your ML or time series processing projects](https://www.neuraxio.com/en/time-series-solution) from us if you need.\n\n","funding_links":[],"categories":["Examples \u0026 Articles","Practical Resources"],"sub_categories":["Librairies and Implementations"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNeuraxio%2FKata-Clean-Machine-Learning-From-Dirty-Code","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNeuraxio%2FKata-Clean-Machine-Learning-From-Dirty-Code","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNeuraxio%2FKata-Clean-Machine-Learning-From-Dirty-Code/lists"}