{"id":18675459,"url":"https://github.com/manuparra/ht-condor","last_synced_at":"2025-07-31T04:02:08.880Z","repository":{"id":79159271,"uuid":"147478360","full_name":"manuparra/ht-condor","owner":"manuparra","description":"HT-Condor Deployment","archived":false,"fork":false,"pushed_at":"2018-10-10T08:35:56.000Z","size":546,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-12-27T20:33:55.250Z","etag":null,"topics":["cern","condor","high-availability","ht-condor","installation","setup"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manuparra.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-05T07:32:33.000Z","updated_at":"2020-05-25T14:59:36.000Z","dependencies_parsed_at":"2023-02-28T01:00:56.974Z","dependency_job_id":null,"html_url":"https://github.com/manuparra/ht-condor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manuparra%2Fht-condor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manuparra%2Fht-condor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manuparra%2Fht-condor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manuparra%2Fht-condor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manuparra","download_url":"https://codeload.github.com/manuparra/ht-condor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239520187,"owners_count":19652644,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cern","condor","high-availability","ht-condor","installation","setup"],"created_at":"2024-11-07T09:25:01.155Z","updated_at":"2025-02-18T17:42:35.469Z","avatar_url":"https://github.com/manuparra.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HT-Condor Deployment @ CERN\n\n\nThis is a tutorial to install HT-Condor from scratch with an example node configuration, consisting of 1 master node, 2 scheduler nodes, 2 workers nodes. Later we will explain each role of each service or node. For this installation you need 5 working nodes, in this case the installation is based on CERN CentOS7, but it will work perfectly for CentOS7 or distributions based on RHEL. We will install HT-Condor 8.6.12.\n\n\nTOC:\n\n- [Requirements](#requirements)\n- [Some important concepts (please read before)](#some-important-concepts--please-read-before-)\n- [Schema of the deployment](#schema-of-the-deployment)\n- [Set-up](#set-up)\n  * [Master, Collector and Negotiator (Master Node)](#master--collector-and-negotiator--master-node-)\n  * [Schedulers (Scheduler 1 and 2 )](#schedulers--scheduler-1-and-2--)\n  * [Workers (Worker 1 and 2 )](#workers--worker-1-and-2--)\n- [Next steps](#next-steps)\n  * [Verify status of resources](#verify-status-of-resources)\n  * [Submiting a Job](#submiting-a-job)\n  * [Support for AFS](#support-for-afs)\n- [Full scripts Semi-Automated setup](#full-script)\n- [HT-Condor with Puppet](#ht--condor-with-puppet)\n- [References](#references)\n\n\n\n# Requirements\n\n- We need 5 nodes (Bare-Metal or Virtual Machine).\n- All nodes can be accessed with root account without a password (ssh password-less).\n- Provide each node with a hostname (like CondorCERN-X where X will be [1 to 5]).\n- Open port 9618 in the firewall of all nodes (HT-CONDOR default port is 9618).\n- We will use SELinux in all nodes; set the status to ``permissive``.\n- The Name and the Hostnames will be:\n  - Master Node : condortest-1.cern.ch\n  - Scheduler Node 1 : condortest-2.cern.ch\n  - Scheduler Node 2 : condortest-3.cern.ch\n  - Worker Node 1 : condortest-4.cern.ch\n  - Worker Node 2 : condortest-5.cern.ch\n\n\n# Some important concepts (please read before)\n\n- The most important daemon is the MASTER. This is the first daemon that is executed on each node where HTCondor runs. All the other HTCondor daemons are launched by the master, **after it has read the configuration files**.\n\n- The daemon that manages the pool is the **COLLECTOR**. As the name suggests, it collects the information from all the nodes running HTCondor and its services inside the pool. All the other daemons send their information to the collector, that can be queried to retrieve the status of the services. **There must be only one collector in each pool**.\n\n- Another daemon is responsible for instantiating and executing jobs on the **WORKERS** nodes. It is the **START** daemon, or **startd**. Each worker node has its own startd, that creates a starter daemon for each job that runs on the machine. The starter manages the input and output and monitors the job execution.\n\n- The daemon that collects the information on the jobs that must be executed, or in other words the daemon that receives the information on the submitted jobs, is called the **SCHEDULER** daemon, or **schedd**. More that one schedd can exist in the same pool. For example, one schedd can be used to manage dedicated resources and parallel jobs and one schedd for everything else. Schedd is the daemon that processes the submission details and manages the queue, but it does not decide where a job must run.\n\n- The daemon that assigns free worker nodes and waiting jobs is the **NEGOTIATOR**. This daemon checks the list of requirements of the queued jobs and looks for free matching resources, retrieving the information from the collector. After a job is associated to a free node, the communications start and the job data are sent to the startd of the machine that will execute it.\n\n\n# Schema of the deployment\n\nFor our installation we have used 5 nodes, with the following roles and services:\n\n- 1 Master Node containing: Collector , Negotiator\n- 2 Scheduler Nodes containing: Scheduler\n- 2 Worker Nodes containing: Worker\n\nThe global schema, containing services and directives (condor_config.local), is the next:\n\n![Esquema](/imgs/schema.png)\n\n# Set-up\n\n\nInstalling the repository and using the 8.6.12 version of HT-Condor.\n\n```\ncd /etc/yum.repos.d\ncurl -O http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel7.repo\ncurl -O  http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor\nrpm --import RPM-GPG-KEY-HTCondor\nyum  -y install condor-all-8.6.12\n/sbin/service condor start\nchkconfig condor on\n```\n\n\nFrom the master node (assuming you have password-less enabled in SSH):\n\nTemporary SELinux to 0 (``permissive`) (Skip this step if you have changed the SELinux mode to permissive):\n\n(Run from Master Node [condortest-1.cern.ch]):\n\n```\nssh condortest-1.cern.ch -o StrictHostKeyChecking=no \"setenforce 0\"\nssh condortest-2.cern.ch -o StrictHostKeyChecking=no \"setenforce 0\"\nssh condortest-3.cern.ch -o StrictHostKeyChecking=no \"setenforce 0\"\nssh condortest-4.cern.ch -o StrictHostKeyChecking=no \"setenforce 0\"\nssh condortest-5.cern.ch -o StrictHostKeyChecking=no \"setenforce 0\"\n\n``` \n\n\n## Master, Collector and Negotiator (Master Node)\n\nCreate the config file in Master Node (condortest-1.cern.ch) and restart the condor service (see the [condor_config.local](scripts/condortest-1/condor_config.local) file that will be created):\n\n(Run from Master Node [condortest-1.cern.ch]):\n\n```\nssh condortest-1.cern.ch -o StrictHostKeyChecking=no 'printf \"COLLECTOR_NAME = Collector N1\\nSCHEDD_HOST=condortest-2.cern.ch,condortest-3.cern.ch\\nDAEMON_LIST= MASTER, SCHEDD,COLLECTOR, NEGOTIATOR\\n\\nALLOW_NEGOTIATOR=condortest-1.cern.ch\\nALLOW_NEGOTIATOR_SCHEDD=condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch\\nALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\n\" \u003e /etc/condor/condor_config.local'\n\nssh condortest-1 -o StrictHostKeyChecking=no '/sbin/service condor stop'\nssh condortest-1 -o StrictHostKeyChecking=no '/sbin/service condor start'\n```\n\n## Schedulers (Scheduler 1 and 2 )\n\nCreate the config file in Scheduler Nodes (condortest-2.cern.ch and condortest-2.cern.ch) and restart the condor service (See the condor_config.local for [Scheduler 1](scripts/condortest-2/condor_config.local) and [Scheduler 2](scripts/condortest-3/condor_config.local)):\n\n(Run from Master Node [condortest-1.cern.ch]):\n\n```\nssh condortest-2 -o StrictHostKeyChecking=no 'printf \"SCHEDD_NAME=Sched N2\\nCOLLECTOR_HOST=condortest-1.cern.ch\\nNEGOTIATOR_HOST=condortest-1.cern.ch\\nALLOW_NEGOTIATOR_SCHEDD=condortest-1.cern.ch\\nDAEMON_LIST= MASTER, SCHEDD\\n\\nALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\n\" \u003e /etc/condor/condor_config.local'\nssh condortest-2 -o StrictHostKeyChecking=no '/sbin/service condor stop'\nssh condortest-2 -o StrictHostKeyChecking=no '/sbin/service condor start'\n```\n\n(Run from Master Node [condortest-1.cern.ch]):\n\n```\nssh condortest-3 -o StrictHostKeyChecking=no 'printf \"SCHEDD_NAME=Sched N3\\nCOLLECTOR_HOST=condortest-1.cern.ch\\nNEGOTIATOR_HOST=condortest-1.cern.ch\\nALLOW_NEGOTIATOR_SCHEDD=condortest-1.cern.ch\\nDAEMON_LIST= MASTER, SCHEDD\\n\\nALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\n\" \u003e /etc/condor/condor_config.local'\nssh condortest-3 -o StrictHostKeyChecking=no '/sbin/service condor stop'\nssh condortest-3 -o StrictHostKeyChecking=no '/sbin/service condor start'\n```\n\n## Workers (Worker 1 and 2 )\n\nCreate the config file in Workers Nodes (condortest-4.cern.ch and condortest-5.cern.ch) and restart the condor service (See the condor_config.local for [Worker 1](scripts/condortest-4/condor_config.local) and [Worker 2](scripts/condortest-5/condor_config.local):\n\n(Run from Master Node [condortest-1.cern.ch]):\n\n```\nssh condortest-4 -o StrictHostKeyChecking=no 'printf \"COLLECTOR_HOST=condortest-1.cern.ch\\nNEGOTIATOR_HOST=condortest-1.cern.ch\\nSCHEDD_HOST=condortest-1.cern.ch,condortest-2.cern.ch,condortest-3.cern.ch\\nDAEMON_LIST= MASTER, STARTD\\n\\nALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\n\" \u003e /etc/condor/condor_config.local'\nssh condortest-4 -o StrictHostKeyChecking=no '/sbin/service condor stop'\nssh condortest-4 -o StrictHostKeyChecking=no '/sbin/service condor start'\n```\n\n(Run from Master Node [condortest-1.cern.ch]):\n\n```\nssh condortest-5 -o StrictHostKeyChecking=no 'printf \"COLLECTOR_HOST=condortest-1.cern.ch\\nNEGOTIATOR_HOST=condortest-1.cern.ch\\nSCHEDD_HOST=condortest-1.cern.ch,condortest-2.cern.ch,condortest-3.cern.ch\\nDAEMON_LIST= MASTER, STARTD\\n\\nALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_WRITE = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\nHOSTALLOW_READ = condortest-1.cern.ch, condortest-2.cern.ch, condortest-3.cern.ch, condortest-4.cern.ch, condortest-5.cern.ch\\n\" \u003e /etc/condor/condor_config.local'\nssh condortest-5 -o StrictHostKeyChecking=no '/sbin/service condor stop'\nssh condortest-5 -o StrictHostKeyChecking=no '/sbin/service condor start'\n```\n\n\n## Next steps\n\nIn our configuration you will not be able to send jobs from the master. To send jobs you must use the nodes Schedulers 1 and 2. To do this connect to one of the nodes SCHEDULER (i.e. Scheduler node 1 [condortest-1.cern.ch])\n \n(Go to Scheduler 1 [condortest-1.cern.ch]))\n\n### Verify status of resources\n\nRun Resources Status (verify you have all resources (2 nodes, 2 cores = 4 resources))\n\n```\ncondor_status\n```\n\nit will returns\n\n```\nName                       OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime\n\nslot1@condortest-4.cern.ch LINUX      X86_64 Unclaimed Idle      0.020 1723  0+07:14:40\nslot2@condortest-4.cern.ch LINUX      X86_64 Unclaimed Idle      0.000 1723  0+07:15:07\nslot1@condortest-5.cern.ch LINUX      X86_64 Unclaimed Idle      0.000 1723  0+03:44:05\nslot2@condortest-5.cern.ch LINUX      X86_64 Unclaimed Idle      0.000 1723  0+03:44:05\n\n                     Machines Owner Claimed Unclaimed Matched Preempting  Drain\n\n        X86_64/LINUX        4     0       0         4       0          0      0\n\n               Total        4     0       0         4       0          0      0\n```\n\n### Submiting a Job\n\n\nLet's use an example Submit file and a simple program in python that executes a simple script that lasts 60 seconds. Copy both (submit file and executable) files to ``/tmp/``.\n\nSubmission file [hello.sub](examples/hello.sub) :\n\n```\n###################################\n#                                 #\n# Condor submit file for hello.py #\n# file name: hello.sub            #\n###################################\n\nexecutable      = /tmp/hello.py\nuniverse = vanilla\nshould_transfer_files   = YES\nwhen_to_transfer_output = ON_EXIT\n\noutput=/tmp/secondTestHello.$(Cluster).$(Process).out\nerror=/tmp/secondTestHello.$(Cluster).$(Process).error\nlog=/tmp/secondTestHello.$(Cluster).$(Process).log\n\nrequest_cpus = 1\nrequirements = (Arch == \"INTEL\" \u0026\u0026 OpSys == \"LINUX\") || (Arch == \"X86_64\" \u0026\u0026 OpSys ==\"LINUX\" )\n\nqueue 1\n```\n\n\nSubmission file [hello.py](examples/hello.py):\n\n\n```\n#!/usr/bin/python \nimport sys\nimport time\ni=1\nwhile i\u003c=60:\n        print i\n        i+=1\n        time.sleep(1)\nprint 2**8\n```\n\nSubmit de Job:\n\n\n```\ncondor_submit hello.sub\n```\n\n\n### Support for AFS\n\nIn order to enable AFS file system support in submissions, it is necessary to use the option ``-spool`` and ``condor_transfer_data all``, to allow writing in the file system to return the results to the corresponding scheduler node from a proxy. See [condor_transfer_data](http://research.cs.wisc.edu/htcondor/manual/v7.6.8/condor_transfer_data.html).\n\n\n```\ncondor_submit -spool hello.sub\n```\n\n```\ncondor_transfer_data --all\n```\n\n\n\n## Full script\n\n- Cloud-Init file, OpenStack initialization [Cloud-Init](openstack/deploy_fast.sh).\n- Auto deployment. To use from the master node [AutoDeploy](openstack/services_deploy.sh).\n\n\n## HT-Condor with Puppet\n\nTBC.\n\n\n## References\n\n- Daemons and more: http://personalpages.to.infn.it/~gariazzo/htcondor/concepts.html#daemons\n- CERN tutorial: https://indico.cern.ch/event/635217/\n- Quick start with HT-Condor: http://information-technology.web.cern.ch/services/fe/lxbatch/howto/quickstart-guide-htcondor\n- Variables in Condor: http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanuparra%2Fht-condor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanuparra%2Fht-condor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanuparra%2Fht-condor/lists"}