{"id":15686997,"url":"https://github.com/vkuznet/mlaas4hep","last_synced_at":"2025-05-07T18:09:19.235Z","repository":{"id":33246669,"uuid":"156857396","full_name":"vkuznet/MLaaS4HEP","owner":"vkuznet","description":"Machine Learning as a Service for HEP","archived":false,"fork":false,"pushed_at":"2022-05-10T14:43:01.000Z","size":4742,"stargazers_count":9,"open_issues_count":1,"forks_count":10,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-05-07T18:07:31.687Z","etag":null,"topics":["hep","machine-learning","ml","ml-model","pytorch","root-cern","tensorflow"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vkuznet.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-11-09T12:10:42.000Z","updated_at":"2023-02-12T23:19:31.000Z","dependencies_parsed_at":"2022-08-08T21:00:06.108Z","dependency_job_id":null,"html_url":"https://github.com/vkuznet/MLaaS4HEP","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vkuznet%2FMLaaS4HEP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vkuznet%2FMLaaS4HEP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vkuznet%2FMLaaS4HEP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vkuznet%2FMLaaS4HEP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vkuznet","download_url":"https://codeload.github.com/vkuznet/MLaaS4HEP/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252931535,"owners_count":21827111,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hep","machine-learning","ml","ml-model","pytorch","root-cern","tensorflow"],"created_at":"2024-10-03T17:42:24.638Z","updated_at":"2025-05-07T18:09:19.173Z","avatar_url":"https://github.com/vkuznet.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"### Machine Learning as a Service for HEP\n\n[![Build Status](https://travis-ci.org/vkuznet/MLaaS4HEP.svg?branch=master)](https://travis-ci.org/vkuznet/MLaaS4HEP)\n[![License:MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/vkuznet/LICENSE)\n[![DOI](https://zenodo.org/badge/156857396.svg)](https://zenodo.org/badge/latestdoi/156857396)\n[![Tweet](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/intent/tweet?text=Machine%20Learning%20as%20a%20service%20for%20HEP%20community\u0026url=https://github.com/vkuznet/MLaaS4HEP\u0026hashtags=python,ml)\n\nMLaaS for HEP is a set of Python based modules to support reading HEP data and\nstream them to ML of user choice for training. It consists of three independent layers:\n- data streaming layer to handle remote data,\n  see [reader.py](https://github.com/vkuznet/MLaaS4HEP/blob/master/src/python/MLaaS4HEP/reader.py)\n- data training layer to train ML model for given HEP data,\n  see [workflow.py](https://github.com/vkuznet/MLaaS4HEP/blob/master/src/python/MLaaS4HEP/workflow.py)\n- data inference layer,\n  see [tfaas_client.py](https://github.com/vkuznet/TFaaS/blob/master/src/python/tfaas_client.py)\n\nThe general architecture of MLaaS4HEP looks like this:\n![MLaaS4HEP-architecture](https://github.com/vkuznet/MLaaS4HEP/blob/master/images/MLaaS4HEP_arch_gen.png)\nEven though this architecture was originally developed for dealing with\nHEP ROOT files we extend it to other data formats. So far the following\ndata formats are supported: JSON, CSV, Parquet, ROOT. The former ones support\nreading files from local file system or HDFS, while later (ROOT) format allows\nto read ROOT files from local file system or remote files via xrootd protocol.\n\nThe pre-trained models can be easily uploaded to\n[TFaaS](https://github.com/vkuznet/TFaaS) inference server for serving them to clients.\n\n### Dependencies\nThe MLaaS4HEP relies on third-party libraries to support reading different\ndata-formats. Here we outline main of them:\n- [pyarrow](https://arrow.apache.org) for reading data from HDFS file system\n- [uproot](https://github.com/scikit-hep/uproot) for reading ROOT files\n- [numpy](https://www.numpy.org), [pandas](https://pandas.pydata.org) for data representation\n- [modin](https://github.com/modin-project/modin) for fast panda support\n- [numba](https://numba.pydata.org) for speeing up individual functions\nFor ML modeling you may use your favorite framework, e.g. Keras, TensorFlow,\nscikit-learn, PyTorch, etc.\nTherefore, we suggest to use [anaconda](https://anaconda.org) to install its dependencies:\n```\n# to install pyarrow, uproot\nconda install -c conda-forge pyarrow uproot numba scikit-learn\n# to install pytorch\nconda install -c pytorch pytorch\n# to install TensorFlow, Kearas, Numpy, Pandas\nconda install keras numpy pandas\n```\n\n### Instalation\nThe easiest way to install and run\n[MLaaS4HEP](https://cloud.docker.com/u/veknet/repository/docker/veknet/mlaas4hep)\nand\n[TFaaS](https://cloud.docker.com/u/veknet/repository/docker/veknet/tfaas)\nis to use pre-build docker images\n```\n# run MLaaS4HEP docker container\ndocker run veknet/mlaas4hep\n# run TFaaS docker container\ndocker run veknet/tfaas\n```\n\n### Reading ROOT files\nMLaaS4HEP python repository provides two base modules to read and manipulate with\nHEP ROOT files. The `reader.py` module defines a DataReader class which is\nable to read either local or remote ROOT files (via xrootd). And, `workflow.py`\nmodule provide a basic DataGenerator class which can be used with any ML\nframework to read HEP ROOT data in chunks. Both modules are based on\n[uproot](https://github.com/scikit-hep/uproot) framework.\n\nBasic usage\n```\n# setup the proper environment, e.g. \n# export PYTHONPATH=/path/src/python # path to MLaaS4HEP python framework\n# export PATH=/path/bin:$PATH # path to MLaaS4HEP binaries\n\n# get help and option description\nreader --help\n\n# here is a concrete example of reading local ROOT file:\nreader --fin=/opt/cms/data/Tau_Run2017F-31Mar2018-v1_NANOAOD.root --info --verbose=1 --nevts=2000\n\n# here is an example of reading remote ROOT file:\nreader --fin=root://cms-xrd-global.cern.ch//store/data/Run2017F/Tau/NANOAOD/31Mar2018-v1/20000/6C6F7EAE-7880-E811-82C1-008CFA165F28.root --verbose=1 --nevts=2000 --info\n\n# both of aforementioned commands produce the following output\nFirst pass: 2000 events, 35.4363200665 sec, shape (2316,) 648 branches: flat 232 jagged\nVMEM used: 960.479232 (MB) SWAP used: 0.0 (MB)\nNumber of events  : 1131872\n# flat branches   : 648\n...  # followed by a long list of ROOT branches found along with their dimentionality\nTrigObj_pt values in [5.03515625, 1999.75] range, dim=21\n```\n\nMore examples about using uproot may be found\n[here](https://github.com/jpivarski/jupyter-talks/blob/master/2017-10-13-lpc-testdrive/uproot-introduction-evaluated.ipynb)\nand\n[here](https://github.com/jpivarski/jupyter-talks/blob/master/2017-10-13-lpc-testdrive/nested-structures-evaluated.ipynb)\n\n### How to train ML model on HEP ROOT data\nThe HEP data are presented in [ROOT](https://root.cern.ch/) data-format.\nThe [DataReader](https://github.com/vkuznet/MLaaS4HEP/blob/master/src/python/MLaaS4HEP/reader.py#L542)\nclass provides access to ROOT files and various APIs to access the HEP data.\n\nA simple workflow example can be found in\n[workflow.py](https://github.com/vkuznet/MLaaS4HEP/blob/master/src/python/MLaaS4HEP/workflow.py)\ncode that executes a full HEP ML workflow, i.e. it can read remote files and perform the training of\nML models with HEP ROOT files.\n\n\nIf you clone the repo and setup your PYTHONPATH you should be able to run it as\nsimple as\n\n```\n# setup the proper environment, e.g. \n# export PYTHONPATH=/path/src/python # path to MLaaS4HEP python framework\n# export PATH=/path/bin:$PATH # path to MLaaS4HEP binaries\n\nworkflow --help\n\n# run the code with list of LFNs from files.txt and using labels file labels.txt\nworkflow --files=files.txt --labels=labels.txt\n\n# run pytorch example\nworkflow --files=files.txt --labels=labels.txt --model=ex_pytorch.py\n\n# run keras example\nworkflow --files=files.txt --labels=labels.txt --model=ex_keras.py\n\n# cat files.txt\n#dasgoclient -query=\"file dataset=/Tau/Run2018C-14Sep2018_ver3-v1/NANOAOD\"\n/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/069A01AD-A9D0-7C4E-8940-FA5990EDFFCE.root\n/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/577AF166-478C-1F40-8E10-044AA4BC0576.root\n/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/9A661A77-58AC-0245-A442-8093D48A6551.root\n/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/C226A004-077B-7E41-AFB3-6AFB38D1A63B.root\n/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/D1E05C97-DB14-3941-86E8-C510D602C0B9.root\n/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/6FA4CC7C-8982-DE4C-BEED-C90413312B35.root\n/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/282E0083-6B41-1F42-B665-973DF8805DE3.root\n\n# cat labels.txt\n1\n0\n1\n0\n1\n1\n1\n\n# run keras example and save our model into external file\nworkflow --files=files.txt --labels=labels.txt --model=ex_keras.py --fout=model.pb\n```\n\nThe `workflow.py` relies on two JSON files, one which contains parameters for\nreading ROOT files and another with specification of ROOT branches. The later\nwill be generated by reading ROOT file itself.\n\n### How to train data using other data-formats\nYou may use `workflow.py` to use other data-formats, e.g. CSV, JSON, Parquet,\nto train your model. The procedure is identical to dealing with HEP ROOT files.\n```\n# prepare your files.txt and labels.txt files, e.g. here we show example\n# of using json gzipped files located on HDFS\ncat files.txt\nhdfs:///path/file1.json.gz\nhdfs:///path/file2.json.gz\n\n# optionally define your preprocessing function, see example in ex_preproc.py\n\n# run workflow with your set of files, labels, model and preprocessing function\n# and save it into model.pb file\nworkflow --files=files.txt --labels=labels.txt --model=ex_keras.py --preproc=ex_preproc.py --fout=model.pb\n```\n\nWe provide more comprehensive example over\n[here](doc/hdfs-example.md)\n\n### HEP resnet\nWe provided full code called `hep_resnet.py` as a basic model based on\n[ResNet](https://github.com/raghakot/keras-resnet) implementation.\nIt can classify images from HEP events, e.g.\n```\nhep_resnet.py --fdir=/path/hep_images --flabels=labels.csv --epochs=200 --mdir=models\n```\nHere we supply input directory `/path/hep_images` which contains HEP images\nin `train` folder along with `labels.csv` file which provides labels.\nThe model runs for 200 epochs and save Keras/TF model into `models` output\ndirectory.\n\n### TFaaS inference server\nWe provide inference server in separate\n[TFaaS](https://github.com/vkuznet/tfaas)\nrepository. It contains full set of instructions how to build and set it up.\n\n### TFaaS client\nTo access your ML model in TFaaS inference server you only need to rely\non HTTP protocol. Please see [TFaaS](https://github.com/vkuznet/tfaas)\nrepository for more information.\n\nBut for convenience we also provide pure python\n[client](https://github.com/vkuznet/TFaaS/blob/master/src/python/tfaas_client.py)\nto perform all necessary actions against TFaaS server. Here is short\ndescription of available APIs:\n\n```\n# setup url to point to your TFaaS server\nurl=http://localhost:8083\n\n# create upload json file, which should include\n# fully qualified model file name\n# fully qualified labels file name\n# model name you want to assign to your model file\n# fully qualified parameters json file name\n# For example, here is a sample of upload json file\n{\n    \"model\": \"/path/model_0228.pb\",\n    \"labels\": \"/path/labels.txt\",\n    \"name\": \"model_name\",\n    \"params\":\"/path/params.json\"\n}\n\n# upload given model to the server\ntfaas_client.py --url=$url --upload=upload.json\n\n# list existing models in TFaaS server\ntfaas_client.py --url=$url --models\n\n# delete given model in TFaaS server\ntfaas_client.py --url=$url --delete=model_name\n\n# prepare input json file for querying model predictions\n# here is an example of such file\n{\"keys\":[\"attribute1\", \"attribute2\"], values: [1.0, -2.0]}\n\n# get predictions from TFaaS server\ntfaas_client.py --url=$url --predict=input.json\n\n# get image predictions from TFaaS server\n# here we refer to uploaded on TFaaS ImageModel model\ntfaas_client.py --url=$url --image=/path/file.png --model=ImageModel\n```\n\n### Citation\nPlease use this publication for further citation:\n[DOI: 10.1007/s41781-021-00061-3](https://doi.org/10.1007/s41781-021-00061-3)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvkuznet%2Fmlaas4hep","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvkuznet%2Fmlaas4hep","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvkuznet%2Fmlaas4hep/lists"}