Ramblings on IT and Physics: Slurm and health check

For torque we used to run every 30 minutes a cronjob that checked if the node is working properly and if not it disabled them. With slurm I finally took the time to look for a way to have slurm automatically do it. Discovered it was extremely easy. You just need to add two config lines:

HealthCheckProgram=/home/test/testNode.sh
HealthCheckInterval=300

Now slurm runs every 5 minutes the health check program and if it gets stuck it's killed within 60s. The script has to perform a check and if a check fails it's got to take care of fixing it or disabling the node. It's done fairly simply. For example we check the presence of /hdfs directory for access to storage and if not ask slurm to drain the node:

# Test HDFS
NHDFS=`ls /hdfs|wc -l`
if [ $NHDFS -eq 0 ]; then
scontrol update NodeName=$NODE State=drain Reason="HDFS mount lost"
fi

You can add pretty much any check you want. The result is that sinfo nicely shows the drained nodes with reasons:

[root@slurm-1 ~]# sinfo -lN
Tue Dec 4 16:39:01 2012
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
wn-v-[2036,..,5384] 8 main* allocated 32 2:16:1 65536 0 1 (null) none
wn-v-[2072,..,7039] 19 main* drained 24+ 2:12+:1 49152+ 0 1 (null) HDFS mount lost
wn-v-[2324,..,6428] 6 main* idle 32 2:16:1 65536 0 1 (null) none

As you can see nodes that have lost the mount are automatically disabled and drained to be taken care of by you at some point.

Ramblings on IT and Physics

Tuesday, December 4, 2012

Slurm and health check

No comments:

Post a Comment