{"id":13574433,"url":"https://github.com/oneapi-src/powerline-fault-detection","last_synced_at":"2025-04-04T15:30:58.295Z","repository":{"id":66145931,"uuid":"574715816","full_name":"oneapi-src/powerline-fault-detection","owner":"oneapi-src","description":"AI Starter Kit for detect faulty signals in power line voltage using Intel® Extension for Scikit-learn*","archived":true,"fork":false,"pushed_at":"2024-02-16T21:12:52.000Z","size":410,"stargazers_count":3,"open_issues_count":0,"forks_count":4,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-11-05T09:44:38.776Z","etag":null,"topics":["machine-learning","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oneapi-src.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-05T23:18:33.000Z","updated_at":"2024-04-08T18:33:52.000Z","dependencies_parsed_at":"2024-02-16T22:42:19.192Z","dependency_job_id":null,"html_url":"https://github.com/oneapi-src/powerline-fault-detection","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fpowerline-fault-detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fpowerline-fault-detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fpowerline-fault-detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fpowerline-fault-detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oneapi-src","download_url":"https://codeload.github.com/oneapi-src/powerline-fault-detection/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247202619,"owners_count":20900818,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","scikit-learn"],"created_at":"2024-08-01T15:00:51.591Z","updated_at":"2025-04-04T15:30:53.276Z","avatar_url":"https://github.com/oneapi-src.png","language":"Python","readme":"PROJECT NOT UNDER ACTIVE MANAGEMENT\n\nThis project will no longer be maintained by Intel.\n\nIntel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.  \n\nIntel no longer accepts patches to this project.\n\nIf you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.  \n\nContact: webadmin@linux.intel.com\n# Powerline Fault Detection\r\n\r\n## Introduction\r\nThis reference kit implements an end-to-end (E2E) workflow with the help of [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html) and [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html) to detect faulty signals from measured power line voltage. With this solution, the detected signals can then be addressed early enough to avoid permanent and expensive damage.\r\n\r\nCheck out more workflow examples in the [Developer Catalog](https://developer.intel.com/aireferenceimplementations).\r\n\r\n## Solution Technical Overview\r\n\r\nFaults in overhead electric transmission lines can lead to a destructive phenomenon called partial discharge (PD). If left alone, partial discharges can eventually destroy equipment. Overhead electric transmission lines run for hundreds of thousands of miles all over the United States of America alone, so manually inspecting the lines for damages that don’t cause an immediate outage is expensive. However, if left alone, undetected faulty lines can lead to partial discharges. Using Machine Learning (ML) to proactively detect partial discharges can reduce maintenance costs and prevent power outages and fires.\r\n\r\nThe purpose of this workflow is to process and analyze the signals from a 3-phase power supply system, used in power lines, to predict whether or not a signal has a partial discharge (which means it is faulty). Using SciPy* and NumPy* calculations, we first extracted features from the synthetic data signals we generated using python packages and open-source libraries, to then apply those features into a supervised ML pipeline using Random Forest Classification (RFC) to infer the signal status. Intel® Extension for Scikit-Learn* is used to optimize this pipeline for better performance.\r\n\r\nThe solution contained in this repo uses the following Intel® packages:\r\n\r\n* ***Intel® Distribution for Python****\r\n\r\n\tThe [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html) provides:\r\n\r\n  * Scalable performance using all available CPU cores on laptops, desktops, and powerful servers.\r\n  * Support for the latest CPU instructions.\r\n  * Near-native performance through acceleration of core numerical and machine learning packages with libraries like the Intel® oneAPI Math Kernel Library (oneMKL) and Intel® oneAPI Data Analytics Library.\r\n  * Productivity tools for compiling Python* code into optimized instructions.\r\n  * Essential Python* bindings for easing integration of Intel® native tools with your Python* project.\r\n\r\n* ***Intel® Extension for Scikit-Learn\\****\r\n\r\n    Designed for data scientists, [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html) is a seamless way to speed up your Scikit-Learn* applications for machine learning to solve real-world problems. This extension package dynamically patches scikit-learn estimators to use Intel® oneAPI Data Analytics Library (oneDAL) as the underlying solver, while achieving the speed up for your machine learning algorithms out-of-box.\r\n\r\nFor more details, visit [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html), [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html), the [Powerline Fault Detection](https://github.com/oneapi-src/powerline-fault-detection) GitHub repository.\r\n\r\n## Solution Technical Details\r\nFor data processing, we will use python libraries SciPy* and NumPy*. The signals (generated using periodic functions and Gaussian noise) are passed as NumPy* arrays. Properties such as signal-to-noise ratio, peak features, etc., are used to extract features which will be used in ML model building. SciPy* and NumPy* are built to handle these heavy vector calculations on the arrays.\r\n\r\nThe features are then passed into a Random Forest model, which is trained and used to predict the faulty signals. Random Forest is a type of ensemble algorithm, which provides the benefits of increased performance and less chance of overfitting. This helps make the model applicable to the general problem rather than this specific dataset.\r\n\r\nThis section provides key implementation details on the proposed reference solution for the target use case. It is organized as follows:\r\n\r\n1. Proposed reference end-to-end architecture.\r\n2. Setting up the Intel® environment.\r\n3. Executing the optimized reference architecture pipeline (data loading and processing, supervised ML training/hyperparameter tuning and inference) using [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html) and [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html).\r\n\r\n### Proposed Architecture using Intel® Packages\r\n\r\nAs mentioned before, the raw signal timeseries data and metadata is synthetically generated using a combination of TimeSynth* and other NumPy*/SciPy* functions. The data generation will be executed as part of the pipeline. After this step, we use NumPy* and SciPy* to process these signals to denoise them for greater accuracy before extracting features to be passed onto training. The supervised ML model training is performed using Random Forest Classification, which is used because of its advantages of using ensemble learning: increased performance and less likelihood of overfitting. We also use hyperparameter tuning with cross validation to optimize the model configuration and increase accuracy for inference. The model can then be used for streaming or batch inference. This entire pipeline is executed running one script, `run_benchmarks.py`.\r\n\r\n![e2e-flow-intel](assets/e2e-flow-intel.png)\r\n\r\nEach step in the pipeline is optimized using Intel® technologies and the ML optimizations from Intel® Extension for Scikit-Learn* are enabled for faster performance.\r\n\r\n### Dataset\r\n\r\nThe dataset used for this workflow is synthetically generated. By default, it a set of 9600 signals' timeseries data and metadata. The synthetic data is then passed to the pipeline that is the focus of this reference kit.\r\n\r\n\u003eNote: Generating synthetic data is a part of this reference kit's pipeline execution, but it is not a part of the analysis and benchmarking. It is just the prerequisite step in the pipeline to have the data.\r\n\r\nEach signal contains 10,000 floating point values depicting one cycle of voltage measured of a 50 Hertz (Hz) 3-phase system power line's voltage. Signal metadata includes the **signal ID**, **group ID** and **phase** since this is a 3-phase power system.\r\nThe dependent variable that we are trying to predict is whether or not the signal has PD, and this is the **target** variable. A **target** value of 0 is negative, which means the signal does not have partial discharge and is functioning normally, whereas a value of 1 means that it is a positive target, so there is partial discharge present and the signal is deemed faulty.\r\n\r\nFor reference, the table below shows what some rows from the signal dataframe may look like. Note that this is column-based, so the signal ID is the column name and the values in one column represent one signal's voltage measured at each time increment.\r\n\r\n| **0**                | **1**          | **2**             | **3**           | **...**\r\n| :---                 | :---           | :---              | :---            | :---\r\n| -3.366481932\t       | -13.24926079   | 10.85013064       | 1.733765166     | ...\r\n| -0.557262024\t       | -11.95544088   | 13.95043041\t    | 1.093923156     | ...\r\n| -3.600549611         | -11.76913344   | 10.82880894\t    | -0.157187225    | ...\r\n| ...                  | ...            | ...               | ...             | ...\r\n\r\nFor reference, the table below is what some rows from the metadata dataframe may look like. Note that here the signals IDs are row-based now, but the same ID is used to map to the relevant column in the signal dataframe. This means that the timeseries signal above in column \"0\" of the signal dataframe maps to the row with sig_id 0 below in the metadata dataframe.\r\n\r\n| **Sig_id**           | **Group_id**   | **Phase**        | **Target**          \r\n| :---                 | :---           | :---             | :---            \r\n| 0\t                   | 0\t            | 0\t               | 0     \r\n| 1\t                   | 0\t            | 1\t               | 0    \r\n| 2                    | 0\t            | 2\t               | 0    \r\n| ...                  | ...            | ...              | ...      \r\n\r\n## Validated Hardware Details\r\nThere are workflow-specific hardware and software setup requirements to run this use case.\r\n\r\n| Recommended Hardware\r\n| ----------------------------\r\n| CPU: Intel® 2nd Gen Xeon® Platinum 8280 CPU @ 2.70GHz or higher\r\n| RAM: 187 GB\r\n| Recommended Free Disk Space: 20 GB or more\r\n\r\nOperating System: Ubuntu* 22.04 LTS.\r\n\r\n## How it Works\r\n\r\n### Data Processing\r\nThe raw synthetic data is then passed into functions in `data_processing.py` and called in `run_benchmarks.py` during code execution. The signals are individually processed to filter out the noise and extract relevant features. These extracted features are then combined into a dataframe which is then passed down to the ML pipeline for training and inference.\r\n\r\n### Model Training/Hyperparameter Tuning + Inference\r\nOnce the features are generated, the data is now ready to be used in model training. Random Forest Classification is used in this instance for its advantages listed in previous sections as a supervised binary classification algorithm. The model is trained on 16 numeric features for each signal, with hyperparameter tuning and cross-validation integrated. The input for both training and inference is the same format features dataset, so we used a Scikit-Learn* function to split the data into train and test input, with a 7:3 split ratio respectively. The default dataset size is 9600 so the training is performed on 6720 signals' feature data. An important step in the training function is hyperparameter tuning using grid search and cross validation. These techniques check combinations of hyperparameters for the model to be trained on and outputs the model that gives the most accurate results, which is then used for inferencing.\r\n\r\n### Batch/Streaming Prediction\r\n\r\nThe model then runs either batch inference or simulated real-time inference, depending on the runtime configuration. As the test input ratio was 30%, batch inference is run on 2880 signals' feature data. Real-time inference is simulated and benchmarked by randomly sampling one signal from the test dataset 1000 times, saving each individual prediction as well as taking the average time for the 1000 iterations for our real-time benchmark. These functions are in `train_and_predict.py` and called in `run_benchmarks.py`.\r\n\r\n### Expected input and output for each step of pipeline\r\n\r\n#### Data Processing\r\n\r\n| **Expected Input**     | **Expected Output**                   | **Comment**              \r\n| :---                   | :---                                  | :---                \r\n| Signal timeseries data as arrays of 10,000 floating point values, i.e. [0.1, 1.2, …, 0.8, -0.2], and metadata, i.e. [ {signal_id: 0, group_id: 0, phase: 1, target: 1}… {…} ] | Feature dataset to be used in training model, ie. {'noise_ratio': 0.001, 'num_pos_peaks': 147, 'num_neg_peaks': 148, 'true_peaks_sd': 2700, … } | Of the 16 numeric features, 1 is pre-processed and the rest are results after the signal gets filtered for noise \r\n\r\n#### Training/Hyperparameter tuning\r\n\r\n| **Expected Input**     | **Expected Output**                   | **Comment**              \r\n| :---                   | :---                                  | :---                \r\n| Signal features dataset | Trained RFC model which is best estimator of given parameters after hyperparameter tuning | The model is tuned for the following parameters: n_estimators, max_leaf_nodes, and max_estimators.\r\n\r\n\r\n#### Batch Prediction\r\n\r\n| **Expected Input**     | **Expected Output**                   | **Comment**              \r\n| :---                   | :---                                  | :---                \r\n| Trained model and test data input | Array of prediction classes of whether a signal is faulty or not based on presence of partial discharge (0 for negative target = not faulty, and 1 for positive target = faulty), ie. {sig_id: target} ie. {… 56:1, 57:1, 58:0…} | The array is used to calculate accuracy and f1-scores for the model. \r\n\r\n#### Streaming Prediction\r\n\r\n| **Expected Input**     | **Expected Output**                   | **Comment**              \r\n| :---                   | :---                                  | :---                \r\n| Trained model and test data input, which is a single row randomly sampled multiple times to simulate real-time inference  | Prediction class for individually inferenced signals (0 for negative target and 1 for positive), ie. {sig_id: target} ie. {… 130:1, 2:1, 488:0…} | Primary objective of running streaming inference is to benchmark time taken for prediction. Average time for a single prediction (with 1000 trials) is written to the log file. Each prediction is still saved and returned in the same format as batch, but must be noted this is **not** the same as batch inference.\r\n\r\n\r\n## Get Started\r\nStart by **defining an environment variable** that will store the workspace path, this can be an existing directory or one to be created in further steps. This ENVVAR will be used for all the commands executed using absolute paths.\r\n\r\n[//]: # (capture: baremetal)\r\n```bash\r\nexport WORKSPACE=$PWD/powerline-fault-detection\r\n```\r\n\r\nDefine `DATA_DIR` and `OUTPUT_DIR`.\r\n\r\n[//]: # (capture: baremetal)\r\n```bash\r\nexport DATA_DIR=$WORKSPACE/data\r\nexport OUTPUT_DIR=$WORKSPACE/output\r\n```\r\n\r\n### Download the Workflow Repository\r\nCreate a working directory for the workflow and clone the [Main\r\nRepository](https://github.com/oneapi-src/powerline-fault-detection) into your working\r\ndirectory.\r\n\r\n[//]: # (capture: baremetal)\r\n```\r\nmkdir -p $WORKSPACE \u0026\u0026 cd $WORKSPACE\r\n```\r\n\r\n```bash\r\ngit clone https://github.com/oneapi-src/powerline-fault-detection $WORKSPACE\r\n```\r\n### Set Up Conda\r\nTo learn more, please visit [install anaconda on Linux](https://docs.anaconda.com/free/anaconda/install/linux/).\r\n\r\n```bash\r\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\r\nbash Miniconda3-latest-Linux-x86_64.sh\r\n```\r\n### Set Up Environment\r\nInstall and set the libmamba solver as default solver. Run the following commands:\r\n\r\n```bash\r\nconda install -n base conda-libmamba-solver -y\r\nconda config --set solver libmamba\r\n```\r\n\r\nThe [env/intel_env.yml](./env/intel_env.yml) file contains all dependencies to create the Intel® environment.\r\n\r\n| **Packages required in YAML file**| **Version**\r\n| :---                              | :--\r\n| python                            | 3.10\r\n| intelpython3_full                 | 2024.0 \r\n| matplotlib                        | 3.8.0 \r\n| pandas                            | 2.1.3 \r\n| pip                               | 23.3.1\r\n| timesynth                         | [e50cdb9](https://github.com/TimeSynth/TimeSynth/commit/e50cdb9015d415adf46a4eae161a087c5c378564) \r\n\r\n Execute next command to create the conda environment.\r\n\r\n```bash\r\nconda env create -f $WORKSPACE/env/intel_env.yml\r\n```\r\n\r\nDuring this setup, `fault_detection_intel` conda environment will be created with the dependencies listed in the YAML configuration. Use the following command to activate the environment created above:\r\n\r\n```bash\r\nconda activate fault_detection_intel\r\n```\r\n\r\n## Supported Runtime Environment\r\nYou can execute this reference pipeline using the following environments:\r\n* Bare Metal\r\n\r\n### Run Using Bare Metal\r\nFollow these instructions to set up and run this workflow on your own development system.\r\n\r\n#### Set Up System Software\r\nOur examples use the ``conda`` package and environment on your local computer. If you don't already have ``conda`` installed, go to [Set up conda](#set-up-conda) or see the [Conda* Linux installation instructions](https://docs.conda.io/projects/conda/en/stable/user-guide/install/linux.html).\r\n\r\n#### Run Workflow\r\nOnce we create and activate the `fault_detection_intel` environment, we can run the next steps.\r\n\r\nThe `run_benchmarks.py` script **generates and processes the data**, **performs feature engineering**, **trains a Random Forest model with hyperparameter tuning**, and **performs inference using the trained model**. It also reports the time taken for relevant technologies at each step in the pipeline. \r\n\r\n\u003e Before running the script, we need to ensure that the appropriate conda environment is activated.\r\n\r\nThe `run_benchmarks.py` script takes the following arguments:\r\n\r\n```shell\r\nusage: run_benchmarks.py [-l LOGFILE] [-s] [-n DATASET_LEN] \r\n    [--data_dir DATA_DIR] -o OUTPUT_DIR\r\n\r\noptions:\r\n  -l LOGFILE, --logfile LOGFILE\r\n                        log file to output benchmarking results to (default: None)\r\n  -s, --streaming       run streaming inference (default: False)\r\n  -n DATASET_LEN, --dataset_len DATASET_LEN\r\n                        number of signals to generate, ideally a multiple of 3 (default: 9600)\r\n  --data_dir DATA_DIR   save synthetic data generated to (default: None)\r\n  -o OUTPUT_DIR, --output_dir OUTPUT_DIR\r\n                        save outputs to (default: None)\r\n```\r\n\r\nFor example, to run inference for 960, 3200, 9600 and 11200 signals, add the next parameters to `run_benchmarks.py` script, respectively:\r\n\r\n[//]: # (capture: baremetal)\r\n```shell\r\npython $WORKSPACE/src/run_benchmarks.py -l $OUTPUT_DIR/logs/log_960.txt -n 960 \\\r\n    -o $OUTPUT_DIR/960 --data_dir $DATA_DIR/960\r\n```\r\n\r\n[//]: # (capture: baremetal)\r\n```shell\r\npython $WORKSPACE/src/run_benchmarks.py -l $OUTPUT_DIR/logs/log_3200.txt -n 3200 \\\r\n    -o $OUTPUT_DIR/3200 --data_dir $DATA_DIR/3200\r\n```\r\n\r\n[//]: # (capture: baremetal)\r\n```shell\r\npython $WORKSPACE/src/run_benchmarks.py -l $OUTPUT_DIR/logs/log_9600.txt -n 9600 \\\r\n    -o $OUTPUT_DIR/9600 --data_dir $DATA_DIR/9600\r\n```\r\n\r\n[//]: # (capture: baremetal)\r\n```shell\r\npython $WORKSPACE/src/run_benchmarks.py -l $OUTPUT_DIR/logs/log_11200.txt -n 11200 \\\r\n    -o $OUTPUT_DIR/11200 --data_dir $DATA_DIR/11200\r\n```\r\n\r\nFor streaming inference, we only need to run one command to get the benchmark (since it is simulated by randomly sampling one signal from the test dataset 1000 times, saving each individual prediction as well as taking the average time for the 1000 iterations):\r\n\r\nTo run real-time inference for one of 960 signals generated, run the command:\r\n\r\n[//]: # (capture: baremetal)\r\n```shell\r\npython $WORKSPACE/src/run_benchmarks.py -s -l $OUTPUT_DIR/logs/log_960_streaming.txt -n 960 \\\r\n    -o $OUTPUT_DIR/960_streaming --data_dir $DATA_DIR/960_streaming\r\n```\r\n\r\n#### Clean Up Bare Metal\r\nFollow these steps to restore your `$WORKSPACE` directory to an initial step. Please note that all downloaded or created dataset files, conda environment, and logs created by the workflow will be deleted. Before executing next steps back up your important files.\r\n\r\n```shell\r\n# activate base environment\r\nconda activate base\r\n# delete conda environment created\r\nconda env remove -n fault_detection_intel\r\n```\r\n\r\n```shell\r\n# remove all outputs generated\r\nrm -rf $OUTPUT_DIR $DATA_DIR\r\n```\r\n\r\n### Expected Output\r\nBelow sample outputs would be generated by the commands executed in [Run Workflow](#run-workflow) section:\r\n\r\nFor `dataset_len=960`:\r\n\r\n```text\r\nINFO:__main__:Beginning data generation\r\nINFO:__main__:Data generated\r\nINFO:__main__:Beginning data processing\r\nINFO:__main__:Data processed\r\nINFO:__main__:Total numpy time in data processing: 7.306015968322754\r\nINFO:__main__:Total scipy time in data processing: 18.405130624771118\r\nINFO:__main__:Beginning model training and inference\r\nINFO:__main__:Training and inferencing complete\r\nINFO:__main__:Dataset splitting time: 0.0023314952850341797\r\nINFO:__main__:Training time: 118.98462414741516\r\nINFO:__main__:Batch inference time: 0.19353175163269043\r\nINFO:__main__:Batch inference accuracy: 0.9548611111111112\r\nINFO:__main__:Batch inference macro F1: 0.705637235631732\r\n```\r\n\r\nFor `dataset_len=3200`:\r\n\r\n```text\r\nINFO:__main__:Beginning data generation\r\nINFO:__main__:Data generated\r\nINFO:__main__:Beginning data processing\r\nINFO:__main__:Data processed\r\nINFO:__main__:Total numpy time in data processing: 26.141985654830933\r\nINFO:__main__:Total scipy time in data processing: 62.58973574638367\r\nINFO:__main__:Beginning model training and inference\r\nINFO:sklearnex:sklearn.utils.validation._assert_all_finite: running accelerated version on CPU\r\nINFO:__main__:Training and inferencing complete\r\nINFO:__main__:Dataset splitting time: 0.003203153610229492\r\nINFO:__main__:Training time: 133.78736805915833\r\nINFO:__main__:Batch inference time: 0.038265228271484375\r\nINFO:__main__:Batch inference accuracy: 0.965625\r\nINFO:__main__:Batch inference macro F1: 0.8395128648068126\r\n```\r\n\r\nFor `dataset_len=9600`:\r\n\r\n```text\r\nINFO:__main__:Beginning data generation\r\nINFO:__main__:Data generated\r\nINFO:__main__:Beginning data processing\r\nINFO:__main__:Data processed\r\nINFO:__main__:Total numpy time in data processing: 85.39899587631226\r\nINFO:__main__:Total scipy time in data processing: 193.02739572525024\r\nINFO:__main__:Beginning model training and inference\r\nINFO:sklearnex:sklearn.utils.validation._assert_all_finite: running accelerated version on CPU\r\n...\r\nINFO:sklearnex:sklearn.utils.validation._assert_all_finite: running accelerated version on CPU\r\nINFO:__main__:Training and inferencing complete\r\nINFO:__main__:Dataset splitting time: 0.04280877113342285\r\nINFO:__main__:Training time: 173.00040245056152\r\nINFO:__main__:Batch inference time: 0.131119966506958\r\nINFO:__main__:Batch inference accuracy: 0.9690972222222223\r\nINFO:__main__:Batch inference macro F1: 0.842022642681621\r\n```\r\nFor `dataset_len=11200`:\r\n\r\n```text\r\nINFO:__main__:Beginning data generation\r\nINFO:__main__:Data generated\r\nINFO:__main__:Beginning data processing\r\nINFO:__main__:Data processed\r\nINFO:__main__:Total numpy time in data processing: 99.85584139823914\r\nINFO:__main__:Total scipy time in data processing: 225.84478378295898\r\nINFO:__main__:Beginning model training and inference\r\nINFO:sklearnex:sklearn.utils.validation._assert_all_finite: running accelerated version on CPU\r\n...\r\nINFO:sklearnex:sklearn.utils.validation._assert_all_finite: running accelerated version on CPU\r\nINFO:__main__:Training and inferencing complete\r\nINFO:__main__:Dataset splitting time: 0.00579071044921875\r\nINFO:__main__:Training time: 178.87619042396545\r\nINFO:__main__:Batch inference time: 0.1290571689605713\r\nINFO:__main__:Batch inference accuracy: 0.9642857142857143\r\nINFO:__main__:Batch inference macro F1: 0.8161711067666539\r\n```\r\n\r\nFor streaming inference (adding `-s` flag) and `dataset_len=960`\r\n\r\n```text\r\nINFO:__main__:Beginning data generation\r\nINFO:__main__:Data generated\r\nINFO:__main__:Beginning data processing\r\nINFO:__main__:Data processed\r\nINFO:__main__:Total numpy time in data processing: 7.338321685791016\r\nINFO:__main__:Total scipy time in data processing: 18.340595960617065\r\nINFO:__main__:Beginning model training and inference\r\nINFO:__main__:Training and inferencing complete\r\nINFO:__main__:Dataset splitting time: 0.0023450851440429688\r\nINFO:__main__:Training time: 122.25705790519714\r\nINFO:__main__:Real-time inference time: 0.025083412408828734\r\n```\r\n\r\nA successful execution of this workflow should produce the following output files in `$OUTPUT_DIR`:\r\n\r\n```text\r\n├── 11200\r\n│   └── predictions.csv\r\n├── 3200\r\n│   └── predictions.csv\r\n├── 960\r\n│   └── predictions.csv\r\n├── 9600\r\n│   └── predictions.csv\r\n├── 960_streaming\r\n│   └── predictions.csv\r\n└── logs\r\n    ├── log_11200.txt\r\n    ├── log_3200.txt\r\n    ├── log_9600.txt\r\n    ├── log_960_streaming.txt\r\n    └── log_960.txt\r\n```\r\n\r\nThe executed commands generate synthetic data and save to `$DATA_DIR`:\r\n\r\n```text\r\n├── 11200\r\n│   ├── metadata.csv\r\n│   └── signal_data.csv\r\n├── 3200\r\n│   ├── metadata.csv\r\n│   └── signal_data.csv\r\n├── 960\r\n│   ├── metadata.csv\r\n│   └── signal_data.csv\r\n├── 9600\r\n│   ├── metadata.csv\r\n│   └── signal_data.csv\r\n└── 960_streaming\r\n    ├── metadata.csv\r\n    └── signal_data.csv\r\n```\r\n\r\n## Summary and Next Steps\r\nIn this reference kit the signals were processed and analyzed from a 3-phase power supply system to build a supervised machine learning pipeline using Random Forest Classification to infer the signal status.\r\n\r\nThe E2E workflow performed next tasks:\r\n1. Training a Random Forest Classification model with hyperparameter tuning built-in.\r\n2. Predicting outcomes over batch data using the trained Random Forest Classification model.\r\n3. Execute testing for four dataset sizes: 960, 3200, 9600, and 11200 signals.\r\n4. Repeating the inference exercise but for streaming data (run with the default size of 960 signals).\r\n\r\nThe proposed architecture using Intel® technologies and the ML optimizations from Intel® Extension for Scikit-Learn* provides faster performance on this fault detection task.\r\n\r\nFault detection with signal processing is a computationally expensive process, but with optimizations and cost-efficient data collection, the advantages of detecting partial discharge earlier will save much more money and resources eventually by avoiding complete outages and failures. The maintenance costs, as well as the need for extensive power line repair, will be lower if the problem is caught early enough. \r\n \r\n\r\n## Learn More\r\nFor more information about or to read about other relevant workflow examples, see these guides and software resources:\r\n\r\n- [Intel® AI Analytics Toolkit (AI Kit)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html)\r\n- [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-Python*.html)\r\n- [Intel® Optimization for TensorFlow*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/optimization-for-tensorflow.html)\r\n- [Intel® Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html)\r\n\r\n## Support\r\nIf you have questions or issues about this use case, want help with troubleshooting, want to report a bug or submit enhancement requests, please submit a GitHub issue.\r\n\r\n## Appendix\r\n\\*Names and brands that may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html).\r\n\r\n### Acknowledgments\r\nThe following are open-source codebases that helped with the foundation of this workflow:\r\n- https://gist.github.com/suhaskv/4b40f1b8c88c9f38abe7d583997bb9f6#file-get_all_peaks-py\r\n- https://www.kaggle.com/code/xhlulu/exploring-signal-processing-with-scipy/notebook\r\n\r\n### Notes\r\n\r\n**Please see this dataset's applicable license for terms and conditions. Intel® does not own the rights to this data set and does not confer any rights to it.**\r\n\r\n### Disclaimers\r\nTo the extent that any public or non-Intel datasets or models are referenced by or accessed using tools or code on this site those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.\r\n\r\nIntel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.\r\n","funding_links":[],"categories":["Table of Contents"],"sub_categories":["AI - Frameworks and Toolkits"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foneapi-src%2Fpowerline-fault-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foneapi-src%2Fpowerline-fault-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foneapi-src%2Fpowerline-fault-detection/lists"}