HTCondor - Useful commands & troubleshooting¶
Getting Help¶
- The documentation for HTCondor is extensive. Many answers to
general questions can be easily found by including
HTCondor
to your search. - The Danforth Center Bioinformatics Core has a Slack Channel https://danforthcenter.slack.com/
- A (free) account is required to join
- This is an excellent, in-house resource for asking questions from other scientists doing bioinformatics.
Basic commands¶
View the current status of the Bioinformatics cluster¶
condor_status
In the example output below we can see that there are five servers currently configured to run jobs. Servers that are completely idle appear as one slot. Servers running one or more jobs will show up as divided into multiple slots, one per job and the remaining idle resources.
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@aerilon.ddps LINUX X86_64 Unclaimed Idle 0.000 257769 2+18:30:21
slot1@pallas.ddpsc LINUX X86_64 Unclaimed Idle 0.000 514276 2+18:46:43
slot1_10@pallas.dd LINUX X86_64 Claimed Busy 1.000 128 0+00:10:27
slot1_11@pallas.dd LINUX X86_64 Claimed Busy 1.000 128 0+00:10:27
slot1_1@pallas.ddp LINUX X86_64 Claimed Busy 47.330 128 0+02:50:26
slot1_2@pallas.ddp LINUX X86_64 Claimed Busy 1.000 128 0+00:10:28
slot1_3@pallas.ddp LINUX X86_64 Claimed Busy 1.000 128 0+00:10:28
slot1_4@pallas.ddp LINUX X86_64 Claimed Busy 1.000 128 0+00:10:28
slot1_5@pallas.ddp LINUX X86_64 Claimed Busy 1.000 128 0+00:10:28
slot1_6@pallas.ddp LINUX X86_64 Claimed Busy 1.000 128 0+00:10:28
slot1_7@pallas.ddp LINUX X86_64 Claimed Busy 1.000 128 0+00:10:27
slot1_8@pallas.ddp LINUX X86_64 Claimed Busy 1.000 128 0+00:10:27
slot1_9@pallas.ddp LINUX X86_64 Claimed Busy 1.000 128 0+00:10:27
slot1@pegasus.ddps LINUX X86_64 Unclaimed Idle 1.000 257769 2+18:40:56
slot1@tauron.ddpsc LINUX X86_64 Unclaimed Idle 1.000 245481 1+19:55:15
slot1_5@tauron.ddp LINUX X86_64 Claimed Busy 0.000 12288 1+14:44:40
slot1@thanatos.ddp LINUX X86_64 Unclaimed Idle 2.000 499558 6+20:46:23
Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX 17 0 12 5 0 0
Total 17 0 12 5 0 0
Status of the HTCondor queue¶
condor_q
In the example below one job is in the queue and is running. The ST column lists the state of each job: I = Idle, R = Running, X = Removed, > = Transferring input files, < = Transferring output files.
-- Schedd: stargate.bioinformatics.danforthcenter.org : <10.5.1.63:15151?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
30.0 nfahlgren 3/3 22:43 0+00:00:02 R 0 3.2 samtools view -b -
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
condor_fullstat
Provides a table of compiled data for jobs currently running from condor_q
and condor_status
. Especially useful for monitoring job usage vs requests.
ID Owner Host CPUs CPUs (%) Memory (GB) Memory (%) Disk (GB) Disk (%) Run Time Cmd
----- ------------- ------------------ ------ ---------- ------------- ------------ ----------- ---------- ----------- -------------
39768 clizarraga pacifica.ddpsc.org 4 0.2 0 0 0 100 28:00:00:41 (interactive)
40655 ebertolini scorpia.ddpsc.org 40 2.5 2.9 2.9 0 100 19:04:45:43 lncrna.sh
40706 abasaro tauron.ddpsc.org 1 106 11.9 79.5 0 100 15:19:22:04 varscan
41814 ebertolini pallas.ddpsc.org 20 0 0 0 0 100 00:05:38:37 (interactive)
41834 abasaro pacifica.ddpsc.org 10 10 0 0 0 100 00:03:09:26 (interactive)
41853 rparvathaneni pacifica.ddpsc.org 5 20 0 0 0 100 00:00:06:20 awk.sh
Analyzing jobs that are on Hold or Idle¶
If upon submitting your job the status listed on condor_q
is "H" or "I" then your job is not running. There are a number of reasons why this could be and using condor_q -analyze [jobID]
will give you more details. In the first and third examples, the job was killed (condor_rm [jobID]
) and the settings adjusted. Some examples include:
- Missing/error with executable or other input file
[JobID]: Request is held.
Hold reason: Error from slot1@pallas.ddpsc.org: Failed to execute '/script.pl' with arguments Single_Astral/ Solanum,Olea: (errno=13: 'Permission denied')
- Requested resources currently unavailable This job was listed as "I" and ended up being the next to run in the queue.
[jobID]: Run analysis summary. Of 34 machines,
24 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
10 match but are serving other users
0 are available to run your job
No successful match recorded.
Last failed match: Tue Apr 19 16:28:20 2016
Reason for last match failure: PREEMPTION_REQUIREMENTS == False
The Requirements expression for your job is:
( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
( TARGET.Cpus >= RequestCpus ) && ( ( TARGET.HasFileTransfer ) ||
( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
Suggestions:
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( TARGET.Memory >= 51200 ) 15
2 ( TARGET.Cpus >= 10 ) 29
3 ( TARGET.Arch == "X86_64" ) 34
4 ( TARGET.OpSys == "LINUX" ) 34
5 ( TARGET.Disk >= 42 ) 34
6 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "ddpsc.org" ) )
34
- Requested resources totally unavailable This job was in "H" due to lack of resources; Condor offered a suggestion so that the job would run
[jobID]: Run analysis summary. Of 27 machines,
27 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are available to run your job
No successful match recorded.
Last failed match: Wed Apr 20 11:29:19 2016
Reason for last match failure: no match found
WARNING: Be advised:
No resources matched request's constraints
The Requirements expression for your job is:
( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer )
Suggestions:
Condition Machines Matched **Suggestion**
--------- ---------------- ----------
1 ( TARGET.Memory >= 102400 ) 0 **MODIFY TO 54884**
2 ( TARGET.Cpus >= 20 ) 5
3 ( TARGET.Arch == "X86_64" ) 27
4 ( TARGET.OpSys == "LINUX" ) 27
5 ( TARGET.Disk >= 400000 ) 27
6 ( TARGET.HasFileTransfer ) 27
HTCondor system usage by group/user¶
condor_userprio
In the example below, user nfahlgren from group jcarrington has used 0.02 CPU hours of computing time.
Last Priority Update: 3/3 23:00
Group Config Use Effective Priority Res Total Usage Time Since Requested
User Name Quota Surplus Priority Factor In Use (wghted-hrs) Last Usage Resources
------------------------ --------- ------- ------------ --------- ------ ------------ ---------- ----------
group_jcarrington 0.30 ByQuota 1000.00 0 0.02 <now> 1
nfahlgren@ddpsc.org 500.24 1000.00 0 0.02 <now>
------------------------ --------- ------- ------------ --------- ------ ------------ ---------- ----------
Number of users: 1 ByQuota 0 0.02 1+00:00
condor_userprio
shows a limited amount of history by default.
If you want to summarize your user and group usage over a longer
period, use condor_userprio -allusers
.
Optimizing resource usage and job performance¶
Understanding the needs and usage of system resources by a program
will allow you to optimize your job requests. CPU usage
is a good example: many programs have a --threads
or --CPU
option, but the
actual usage can vary, resulting in over-request by the user and higher
costs. Meanwhile, under-requesting scratch disk or RAM relative to the job's usage can
result in a server becoming overloaded and jobs being killed by the system.
To evaluate the requirements of a job, particularly a job that will be run repeatedly, a good approach is to run one instance of the job and monitor the resource usage. For scratch disk and RAM usage, the log file produced by HTCondor gives the information you need. At the end of the log file, once the job is complete, you will find the following:
(1) Normal termination (return value 0)
Usr 0 00:44:19, Sys 0 00:04:07 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:44:19, Sys 0 00:04:07 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
3943671552 - Run Bytes Sent By Job
639678720 - Run Bytes Received By Job
3943671552 - Total Bytes Sent By Job
639678720 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 8 8
Disk (KB) : 4475941 10485760 10716960
Memory (MB) : 166 500 500
In this example, the job finished with a return value 0
(Normal termination).
8 CPUs
were requested (and "used")
10 GB of Scratch Disk
were requested and 4.3 GB
were used
500 MB of RAM
were requested and 166 MB
were used
- If this job was part of a set, one might want to maximize the number of jobs that could simultaneously run. Thus some possible modifications to the job file could be: request less Scratch Disk (6GB) and request less RAM (250MB). It's still a good idea to include a buffer since not all of the jobs will have identical usage.
CPU Usage
The log file only reports the number of CPUs requested as this is the usage as far
as the system is concerned - these CPUs get checked out for your request and are thus
unavailable to other users. To determine CPU usage, a within-system tool is condor_fullstat
which will report on the actual CPU usage of a program. This could be run frequently over
the course of a job's run, with the user monitoring the actual percentage of CPUs used.
For example:
ID Owner Host CPUs CPUs (%) Memory (GB) Memory (%) Disk (GB) Disk (%) Run Time Cmd
----- ---------- ------------------ ------ ---------- ------------- ------------ ----------- ---------- ----------- ---------
40655 xxxxxxxxxx scorpia.ddpsc.org 40 2.5 2.9 2.9 0 100 19:21:28:54 job.sh
-
Job_40655
has requested 40 CPUs, and has been running for >19 days. However, a repeated look atcondor_fullstat
determines that the job only used 40 CPUs for an early step, and has since been only using 2.5% (or 1 CPU) for the majority of the job. Thus, if this program were re-run, it may be worth modifying the CPU request, or modifying the program itself. -
Another approach is searching the internet - many common bioinformatic programs have papers, blog posts, etc, that report the CPU usage efficiency.
bwa
is a good example where requests with 8 CPUs do not have 2x improved performance over 4 CPUs.- Therefore, if you have i.e. 50 bwa jobs to run, the fastest way to get them done would be to request 4-8 CPUs each. This means that more jobs can run simultaneously, even if they take a bit longer each, resulting in the batch of jobs being done sooner for you to continue your analyses.
- Intel White Paper for GATK Pipeline