For torque we used to run every 30 minutes a cronjob that checked if the node is working properly and if not it disabled them. With slurm I finally took the time to look for a way to have slurm automatically do it. Discovered it was extremely easy. You just need to add two config lines:
Now slurm runs every 5 minutes the health check program and if it gets stuck it's killed within 60s. The script has to perform a check and if a check fails it's got to take care of fixing it or disabling the node. It's done fairly simply. For example we check the presence of /hdfs directory for access to storage and if not ask slurm to drain the node:
# Test HDFS
NHDFS=`ls /hdfs|wc -l`
if [ $NHDFS -eq 0 ]; then
scontrol update NodeName=$NODE State=drain Reason="HDFS mount lost"
You can add pretty much any check you want. The result is that sinfo nicely shows the drained nodes with reasons:
[root@slurm-1 ~]# sinfo -lN
Tue Dec 4 16:39:01 2012
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
wn-v-[2036,..,5384] 8 main* allocated 32 2:16:1 65536 0 1 (null) none
wn-v-[2072,..,7039] 19 main* drained 24+ 2:12+:1 49152+ 0 1 (null) HDFS mount lost
wn-v-[2324,..,6428] 6 main* idle 32 2:16:1 65536 0 1 (null) none
As you can see nodes that have lost the mount are automatically disabled and drained to be taken care of by you at some point.