{"id":13574443,"url":"https://github.com/oneapi-src/predictive-asset-health-analytics","last_synced_at":"2025-04-04T15:31:00.120Z","repository":{"id":45002285,"uuid":"506429945","full_name":"oneapi-src/predictive-asset-health-analytics","owner":"oneapi-src","description":"AI Starter Kit for Predictive Asset Maintenance using Intel® optimized version of XGBoost","archived":true,"fork":false,"pushed_at":"2024-02-01T23:51:28.000Z","size":528,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-05T09:44:36.601Z","etag":null,"topics":["machine-learning","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oneapi-src.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-22T23:00:29.000Z","updated_at":"2024-04-08T18:31:35.000Z","dependencies_parsed_at":"2024-02-13T00:49:36.383Z","dependency_job_id":null,"html_url":"https://github.com/oneapi-src/predictive-asset-health-analytics","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fpredictive-asset-health-analytics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fpredictive-asset-health-analytics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fpredictive-asset-health-analytics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fpredictive-asset-health-analytics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oneapi-src","download_url":"https://codeload.github.com/oneapi-src/predictive-asset-health-analytics/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247202631,"owners_count":20900820,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","scikit-learn"],"created_at":"2024-08-01T15:00:51.672Z","updated_at":"2025-04-04T15:30:55.110Z","avatar_url":"https://github.com/oneapi-src.png","language":"Python","funding_links":[],"categories":["Table of Contents"],"sub_categories":["AI - Frameworks and Toolkits"],"readme":"PROJECT NOT UNDER ACTIVE MANAGEMENT\n\nThis project will no longer be maintained by Intel.\n\nIntel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.  \n\nIntel no longer accepts patches to this project.\n\nIf you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.  \n\nContact: webadmin@linux.intel.com\n# Predictive Asset Health Analytics\n\n## Introduction\nCreate an end-to-end predictive asset maintenance solution to predict defects and anomalies before they happen with XGBoost* from [Intel® oneAPI AI Analytics Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html) (oneAPI). Check out more workflow examples in the [Developer Catalog](https://developer.intel.com/aireferenceimplementations).\n\n## **Table of Contents**\n\n- [Solution Technical Overview](#solution-technical-overview)\n- [Validated Hardware Details](#validated-hardware-details)\n- [How it Works?](#how-it-works)\n- [Get Started](#get-started)\n  - [Download the Workflow Repository](#download-the-workflow-repository)\n  - [Set Up Conda](#set-up-conda)\n  - [Set Up Environment](#set-up-environment)\n- [Ways to run this reference use case](#Ways-to-run-this-reference-use-case)\n  - [Run Using Bare Metal](#run-using-bare-metal)\n  - [Run Using Jupyter Notebook](#run-using-jupyter-notebook)\n- [Expected Output](#expected-output)\n- [Summary and Next Steps](#summary-and-next-steps)\n- [Learn More](#learn-more)\n- [Support](#support)\n- [Appendix](#appendix)\n\n\n## Solution Technical Overview\n\nPredictive asset maintenance is a method that uses data analysis tools to predict defects and anomalies before they happen. Solutions of huge scale typically require operating across multiple hardware architectures. Accelerating training for the ever-increasing size of datasets and machine learning models is a major challenge while adopting AI (Artificial Intelligence).\n\nFor an industrial scenario is important to improve the MLOps (Machine Learning Operations) time for developing and deploying new models, this could be challenging due to the ever-increasing size of datasets over a period of time. XGBoost* classifier with HIST tree method addresses this problem improving the overall training/tuning and validation time. A model with a huge set of batch processing requires fast prediction time with a low accuracy lose, daal4py helps the XGBoost* machine learning model to achieve this criteria.\n\nThe solution contained in this repo uses the following Intel® packages:\n\n* ***Intel® Distribution for Python****\n\n  The [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html#gs.52te4z) provides:\n\n    * Scalable performance using all available CPU cores on laptops, desktops, and powerful servers\n    * Support for the latest CPU instructions\n    * Near-native performance through acceleration of core numerical and machine learning packages with libraries like the Intel® oneAPI Math Kernel Library (oneMKL) and Intel® oneAPI Data Analytics Library\n    * Productivity tools for compiling Python code into optimized instructions\n    * Essential Python bindings for easing integration of Intel® native tools with your Python* project\n\n* ***Intel® Distribution of Modin****\n\n    Modin* is a drop-in replacement for pandas, enabling data scientists to scale to distributed DataFrame processing without having to change API code. [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html) adds optimizations to further accelerate processing on Intel hardware.\n\nFor more details, visit [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html#gs.52te4z), [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html), the [Predictive Asset Health Analytics](https://github.com/oneapi-src/predictive-asset-health-analytics) GitHub repository, the [XGBoost* documentation webpage](https://xgboost.readthedocs.io/en/stable/) and the [daal4py documentation webpage](https://intelpython.github.io/daal4py/).\n\n## Validated Hardware Details \n\n[Intel® oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html#gs.52tat6) is used to achieve quick results even when the data for a model are huge. It provides the capability to reuse the code present in different languages so that the hardware utilization is optimized to provide these results.\n\n| Recommended Hardware\n| ----------------------------\n| CPU: Intel® 2th Gen Xeon® Platinum 8280 CPU @ 2.70GHz or higher\n| RAM: 187 GB\n| Recommended Free Disk Space: 20 GB or more\n\nCode was tested on Ubuntu\\* 22.04 LTS.\n\n## How it Works\n\nThis reference kit generates datasets of given row size for a predictive asset maintenance analytics use-case and stores it in ‘. pkl’ format. The data is splitted into two subsets, the first subset will train the XGBoost* model and the second will be use to test the model's prediction capabilities.\n\nThe below diagram presents the different stages that compose the end-to-end workflow.\n\n![Use_case_flow](assets/predictive_asset_maintenance_e2e_flow.png)\n\n\n## Get Started\nStart by defining an environment variable that will store the workspace path, these directories will be created in further steps and will be used for all the commands executed using absolute paths.\n\n[//]: # (capture: baremetal)\n```bash\nexport WORKSPACE=$PWD/predictive-health-analytics\nexport DATA_DIR=$WORKSPACE/data\nexport OUTPUT_DIR=$WORKSPACE/output\n```\n### Download the Workflow Repository\nCreate a working directory for the workflow and clone the [Main\nRepository](https://github.com/oneapi-src/predictive-asset-health-analytics) repository into your working\ndirectory.\n\n[//]: # (capture: baremetal)\n```bash\nmkdir -p $WORKSPACE \u0026\u0026 cd $WORKSPACE\n```\n\n```\ngit clone https://github.com/oneapi-src/predictive-asset-health-analytics.git $WORKSPACE\n```\n\n[//]: # (capture: baremetal)\n```bash\nmkdir -p $DATA_DIR $OUTPUT_DIR/logs\n```\n### Set Up Conda\nTo learn more, please visit [install anaconda on Linux](https://docs.anaconda.com/free/anaconda/install/linux/). \n```bash\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh\n```\n\n### Set Up Environment\nThe conda yaml dependencies are kept in `$WORKSPACE/env/intel_env.yml`.\n\n| **Packages required in YAML file:**                 | **Version:**\n| :---                          | :--\n| `python`  | 3.10\n| `intelpython3_full`  | 2024.0.0\n| `modin-all`  | 0.24.1\n\nFollow the next steps for Intel® Python* Distribution setup inside conda environment:\n```bash\nconda env create -f $WORKSPACE/env/intel_env.yml --no-default-packages\n```\n\nEnvironment setup is required only once. This step does not cleanup the existing environment with the same name; make sure no conda environment exists with the same name. During this setup a new conda environment will be created with the dependencies listed in the YAML configuration.\n\nOnce the appropriate environment is created with the previous step then it has to be activated using the conda command as given below:\n```bash\nconda activate predictive_maintenance_intel\n```\n\n## Ways to run this reference use case\nYou can execute the references pipelines using the following environments:\n* Bare Metal\n* Jupyter Notebook\n\n---\n\n### Run Using Bare Metal\n\n#### Set Up System Software\nOur examples use the `conda` package and environment on your local computer. If you don't already have `conda` installed or the `conda` environment created, go to [Set Up Conda*](#set-up-conda) or see the [Conda* Linux installation instructions](https://docs.conda.io/projects/conda/en/stable/user-guide/install/linux.html).\n\n\n#### Run Workflow\nThe below bash script, located in ```$WORKSPACE```, needs to be executed to start creating the test dataset and training the model using pandas/modin. \n```sh\nbash $WORKSPACE/run_dataset.sh\n```\n| **Option** | **Values**\n| :--        | :--\n| Dataset Size | `25K to 10M`\n| Hyperparameter tuning | `notuning` - Training without hyperparameter tuning\u003cbr\u003e`hyperparametertuning` - Training with hyperparameter tuning\n| Number of CPU cores | Based on the total number of cores available on the execution environment\n\nThis stage invokes two python scripts to generate the test dataset with the chosen size and to train the model with selected data package library. The data generation process will create a folder with the name of the active conda environment; all the dataset and the log files will be captured. The dataset file will be saved in pickle format and it will be reused in further test runs on this same environment for the same dataset size.\n\nExample option selection for Pandas with 1M dataset size as given below\n\n```\n        0. 25000\n        1. 50000\n        2. 100000\n        3. 200000\n        4. 400000\n        5. 800000\n        6. 1000000\n        7. 2000000\n        8. 4000000\n        9. 8000000\n        10. 10000000\nSelect dataset size: 6\n        0. notuning\n        1. hyperparametertuning\nSelect tuning option: 0\nNumber of CPU cores to be used for the training: 8\n```\n\nLog file will be generated in the below location:\n```bash\n$OUTPUT_DIR/logs/logfile_pandas_\u003cdataset_size\u003e_\u003ctimestamp\u003e.log\n$OUTPUT_DIR/logs/logfile_train_predict_\u003cdataset_size\u003e_\u003ctimestamp\u003e.log\n```\nTest data pickle file will be generated in the below location:\n```bash\n$DATA_DIR/data_\u003cdataset_size\u003e.pkl\n```\nAlternatively, user can run `generate_data_pandas.py` and `train_predict_pam.py` scripts, described below, instead of `run_dataset.sh`; running each Python script independently provides more options for the user to experiment. `generate_data_pandas.py` will automatically create the dataset and, `train_predict_pam.py` will run train and prediction with the previously generated dataset.\n\nThe dataset generation script uses the following optional arguments:\n\n```bash\nusage: src/generate_data_pandas.py [-h] [-s SIZE] [-f FILE]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -s SIZE, --size SIZE  data size which is number of rows\n  -f FILE, --file FILE  output pkl file name\n  -d, --debug           Changes logging level from INFO to DEBUG\n```\n\nFor example, below command should generate the dataset of 25k rows and saves the log file.\n\n[//]: # (capture: baremetal)\n```bash\nexport DATASIZE=25000\nexport OF=$OUTPUT_DIR/logs/logfile_pandas_${DATASIZE}_$(date +%Y%m%d%H%M%S).log \npython $WORKSPACE/src/generate_data_pandas.py -s ${DATASIZE} -f $DATA_DIR/dataset_${DATASIZE}.pkl 2\u003e\u00261 | tee $OF\necho \"Logfile saved: $OF\"\n```\nTraining and prediction along with hyperparameter turning can also be executed independently with the following arguments:\n```bash\nusage: src/train_predict_pam.py [-h] [-f FILE] [-p PACKAGE] [-t TUNING] [-cv CROSS_VALIDATION] [-patch PATCH_SKLEARN]\n                            -ncpu NUM_CPU\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -f FILE, --file FILE  input pkl file name\n  -p PACKAGE, --package PACKAGE\n                        data package to be used (pandas, modin)\n  -t TUNING, --tuning TUNING\n                        hyper parameter tuning (0/1)\n  -cv CROSS_VALIDATION, --cross-validation CROSS_VALIDATION\n                        cross validation iteration\n  -ncpu NUM_CPU, --num-cpu NUM_CPU\n                        number of cpu cores, default 4.\n  -d, --debug           \n                        changes logging level from INFO to DEBUG\n```\nFor example, below command should take the 25k dataset pkl file generated in the previous example and perform the training and prediction using XGBoost* classifier algorithm.\n\n[//]: # (capture: baremetal)\n```bash\nexport PACKAGE=\"pandas\"\nexport TUNING=0\nexport NCPU=20\nexport CROSS_VAL=4\nexport OF=$OUTPUT_DIR/logs/logfile_train_predict_${DATASIZE}_$(date +%Y%m%d%H%M%S).log \npython $WORKSPACE/src/train_predict_pam.py -f $DATA_DIR/dataset_${DATASIZE}.pkl -t $TUNING -ncpu $NCPU -p $PACKAGE -cv $CROSS_VAL 2\u003e\u00261  | tee -a $OF \necho \"Logfile saved: $OF\"\n```\n\n#### XGBoost* with oneDAL Python Wrapper (daal4py) model\nTo gain even further improved performance on prediction time for the XGBoost* trained machine learning model, it can be converted to a daal4py model. daal4py makes XGBoost* machine learning algorithm execution faster to gain better performance on the underlying hardware by utilizing the Intel® oneAPI Data Analytics Library (oneDAL).\n\nThe previously generated pkl file is used as input for this Python script. \n```bash\nusage: src/daal_xgb_model.py [-h] [-f FILE]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -f FILE, --file FILE  input pkl file name\n  -d, --debug           changes logging level from INFO to DEBUG\n```\nRun the following command to train the model with the given dataset, convert the same to daal4py format and measure the prediction time performance.\n\n[//]: # (capture: baremetal)\n```bash\npython $WORKSPACE/src/daal_xgb_model.py -f  $DATA_DIR/dataset_${DATASIZE}.pkl\n```\n#### Clean Up Bare metal\nBefore proceeding to the cleaning process, it is strongly recommended to make a backup of the data that the user wants to keep. To clean the previously downloaded and generated data, run the following commands:\n```bash\nconda deactivate #Run line if predictive_maintenance_intel is active\nconda env remove -n predictive_maintenance_intel\nrm $OUTPUT_DIR $DATA_DIR $WORKSPACE -rf\n```\n\n---\n### Run Using Jupyter Notebook\nBefore continuing steps described in [Get Started](#get-started).\n\n#### Create and activate conda environment\nTo be able to run `Fraud_Detection_Notebook.ipynb` a [conda environment](#set-up-environment) must be created:\n```bash\nconda activate predictive_maintenance_intel\nconda install -c intel -c conda-forge nb_conda_kernels jupyterlab -y\n```\nFollow the steps in [Get Started](#get-started) section before continuing. Run the following command inside of the project root directory. ENVVARs must be set in the same terminal that will run Jupyter Notebook.\n```bash\ncd $WORKSPACE\njupyter lab\n```\nOpen Jupyter Notebook in a web browser, select `PredictiveMaintenance.ipynb` and select `conda env:predictive_maintenance_intel` as the jupyter kernel. Now you can follow the notebook's instructions step by step.\n\n#### Clean Up Jupyter Notebook\nTo clean Jupyter Notebook follow the instructions described in [Clean Up Bare Metal](#clean-up-bare-metal).\n\n## Expected Output\nA successful execution of ```generate_data_pandas.py``` should return similar results as shown below:\n\n```\nINFO:__main__:Generating data with the size 25000\nINFO:__main__:changing Tele_Attatched into an object variable\nINFO:__main__:Generating our target variable Asset_Label\nINFO:__main__:Creating correlation between our variables and our target variable\nINFO:__main__:When age is 60-70 and over 95 change Asset_Label to 1\nINFO:__main__:When elevation is between 500-1500 change Asset_Label to 1\nINFO:__main__:When Manufacturer is A, E, or H change Asset_Label to have  95% 0's\nINFO:__main__:When Species is C2 or C5 change Asset_Label to have 90% to 0's\nINFO:__main__:When District is NE or W change Asset_Label to have 90% to 0's\nINFO:__main__:When District is Untreated change Asset_Label to have 70% to 1's\nINFO:__main__:When Age is greater than 90 and Elevaation is less than 1200              and Original_treatment is Oil change Asset_Label to have 90% to 1's\nINFO:__main__:=====\u003e Time taken 0.049012 secs for data generation for the size of (25000, 34)\nINFO:__main__:Saving the data to /localdisk/aagalleg/frameworks.ai.platform.sample-apps.predictive-health-analytics/predictive-health-analytics/data/dataset_25000.pkl ...\nINFO:__main__:DONE\n```\n\nA successful execution of ```train_predict_pam.py``` should return similar results as shown below:\n\n```\nINFO:__main__:=====\u003e Total Time:\n6.791231 secs for data size (800000, 34)\nINFO:__main__:=====\u003e Training Time 3.459683 secs\nINFO:__main__:=====\u003e Prediction Time 0.281359 secs\nINFO:__main__:=====\u003e XGBoost accuracy score 0.921640\nINFO:__main__:DONE\n```\n\nA successful execution of ```daal_xgb_model.py``` should return similar results as shown below:\n\n```\nINFO:__main__:Reading the dataset from ./data/data_800000.pkl...\nINFO:root:sklearn.model_selection.train_test_split: running accelerated version on CPU\nINFO:root:sklearn.model_selection.train_test_split: running accelerated version on CPU\nINFO:__main__:XGBoost training time (seconds): 74.001453\nINFO:__main__:XGBoost inference time (seconds): 0.054897\nINFO:__main__:DAAL conversion time (seconds): 0.366412\nINFO:__main__:DAAL inference time (seconds): 0.017998\nINFO:__main__:XGBoost errors count: 15622\nINFO:__main__:XGBoost accuracy: 0.921890\nINFO:__main__:Daal4py errors count: 15622\nINFO:__main__:Daal4py accuracy: 0.921890\nINFO:__main__:XGBoost Prediction Time: 0.054897\nINFO:__main__:daal4py Prediction Time: 0.017998\nINFO:__main__:daal4py time improvement relative to XGBoost: 0.672158\nINFO:__main__:Accuracy Difference 0.000000\n```\n\n## Summary and Next Steps\n\nPredictive asset maintenance solutions of huge scale typically require acceleration in training and prediction for the ever-increasing size of datasets without changing the existing computing resources in order to make their solutions feasible and economically attractive for Utility customers. This reference kit implementation provides performance-optimized guide around utility asset maintenance use cases that can be easily scaled across similar use cases.\n\n\n## Learn More\nFor more information about predictive asset maintenance or to read about other relevant workflow examples, see these guides and software resources:\n\n- [Intel® AI Analytics Toolkit (AI Kit)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html)\n- [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html#gs.52te4z)\n- [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html)\n- [XGBoost Documentation](https://xgboost.readthedocs.io/en/stable/)\n- [Fast, Scalable and Easy Machine Learning With DAAL4PY](https://intelpython.github.io/daal4py/)\n\n## Support\n\nThe End-to-end Predictive Asset Health Analytics team tracks both bugs and\nenhancement requests using [GitHub\nissues](https://github.com/oneapi-src/predictive-asset-health-analytics/issues).\nBefore submitting a suggestion or bug report, search the [DLSA GitHub\nissues](https://github.com/oneapi-src/predictive-asset-health-analytics/issues/issues) to\nsee if your issue has already been reported.\n\n## Appendix\n\n\\*Names and brands that may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html).\n\n### Disclaimers\n\nTo the extent that any public or non-Intel datasets or models are referenced by or accessed using tools or code on this site those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.\n\nIntel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foneapi-src%2Fpredictive-asset-health-analytics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foneapi-src%2Fpredictive-asset-health-analytics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foneapi-src%2Fpredictive-asset-health-analytics/lists"}