{"id":13604748,"url":"https://github.com/MachineLearningSystem/synergy","last_synced_at":"2025-04-12T02:31:37.291Z","repository":{"id":185461961,"uuid":"512959321","full_name":"MachineLearningSystem/synergy","owner":"MachineLearningSystem","description":null,"archived":false,"fork":true,"pushed_at":"2022-05-27T13:52:10.000Z","size":16107,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2024-10-20T18:34:57.518Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"msr-fiddle/synergy","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README-offline-profiler.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-07-12T01:29:23.000Z","updated_at":"2022-05-24T16:24:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"57b055fd-713c-4b85-ad7e-6cf2ebd63ea8","html_url":"https://github.com/MachineLearningSystem/synergy","commit_stats":null,"previous_names":["machinelearningsystem/synergy"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fsynergy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fsynergy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fsynergy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fsynergy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/synergy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223489636,"owners_count":17153792,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:50.837Z","updated_at":"2025-04-12T02:31:37.286Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"## Looking Beyond GPUs for DNN Scheduling for Multi-Tenant GPU Clusters\n\n## Using the profiler\n\npython offline_profiler.py \u003clist of profiler options\u003e  \u003ctraining script\u003e  \u003cadditional training script specific args\u003e\n  \n  1.  If an argument is repeated in both profiler options and train script options, profiler option is chosen\n  2.  Profiler_options\n      * --docker-img  : docker container image name (full path so that it can be downloaded if not present locally \n      * --container-mnt: mountpoint in the container where dataset will be mounted. Default : /datadrive\n      * --num-gpus    : number of GPUs used in training\n      * --nnodes      : number of nodes used in training\n      * --master_addr : IP of master node (same as torch.distr.launch)\n      * --master_port : Free port on master (same as torch.distr.launch)\n      * --cpu         : Max # CPUs to use. If not specified, uses all available CPUs on the server\n      * --memory      : Max memory (GB) to be used in profiling. If not specified, uses max DRAM on system\n      * -b            : Per-GPU batch size\n      * --max-iterations: Max iterations per profiling run\n      \n  3. The following options are expected to be supported by the job script. \n      * --batch, -b   : Batch size per GPU\n      * --workers, -j : Number of CPU data workers\n      * --max_iterations: Return training after these # of iterations\n      * --data        : Path to dataset\n\nList of all other supported options can be found using the following command\n```\npython offline_profiler.py -h\n```\n\n### Run instructions\n\nSynergy docker \n```\n - git clone https://github.com/jayashreemohan29/Synergy-CoorDL.git\n - git checkout iterator_chk\n - cd docker\n - CREATE_RUNNER=\"YES\" ./build.sh\n This will create a docker container tagged nvidia/dali:py36_cu10.run\n\n```\n -  cd synergy-private/src\n -  cd profiler; ./prereq.sh; cd ..\n -  If you have issues using the docker container, please ask us and we shall give a docker tar (~8GB, hence not uploaded here). You can load the docker container with dependencies installed : docker load -i dali_docker.tar\n -  python profiler/offline_profiler.py --job-name job-res18 --cpu 24 --num-gpus 4 -b 512 doiiii--docker-img=nvidia/dali:py36_cu10.run ../models/image_classification/pytorch-imagenet.py --dali --amp --dali_cpu --max-iterations 50 --workers 3 -b 512 --data '/datadrive/mnt4/jaya/datasets/imagenet/' | tee res18out.log\n - or execute ./run-cnns.sh to profile all image classification models. Please update dataset path appropriately in the script\n```\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["GPU Cluster Management"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fsynergy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Fsynergy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fsynergy/lists"}