
| I have been spending a lot of time lately with the handy NetApp Data ONTAP 8.1.1 cluster mode simulator, including getting a lab setup at home. It's a fantastic product and stands to be a very disruptive revolution in the way enterprises manage and think about storage but all of that has been covered over and over again by people more articulate on the messaging than I. But since I monitor all my personal IT infrastructure with Nagios already I wanted to monitor my cluster (even though it is a simulator) as well.
For my 7G filer I have been using check_naf.py for a while now and it works splendidly. It checks the health of my various SnapVault relationships, volumes, and the filer itself with relative ease but since the fundamental architecture of cluster mode is different sadly check_naf doesn't work there.
I initially thought about tracking down and re-working check_naf to work with cluster mode but it became pretty obvious that it wasn't really going to be all that feasible so I set out to start poking at the NetApp Managability SDK. They provide an external API for all management of our arrays which we utilize in our OnCommand Unified Manager and System Manager products so it should be pretty straightforward to utilize the API to check and report on the health of the cluster.... right?
Well it turns out that flying in the face of convention it really wasn't that hard. I grabbed the SDK and within a few hours had a working Nagios check script reporting on the general health of the cluster, volumes, disks, and SnapMirror DP relationships. You can take a look at check_nac.py in my git repository here. It should be pretty straightforward to use, I tried to create a self-explanitory usage statement for it and follow the Nagios conventions for command line options but for your convenience, Dearest Lazyweb, here are some example stanzas out of my Nagios config files:
define command {
command_name check_nac_vols
command_line /usr/local/bin/check_nac.py -H $HOSTADDRESS$ -u admin -p mysekritpassword -t vols -w 80 -c 90
}
define command {
command_name check_nac_nodes
command_line /usr/local/bin/check_nac.py -H $HOSTADDRESS$ -u admin -p mysekritpassword -t nodes
}
define command {
command_name check_nac_disks
command_line /usr/local/bin/check_nac.py -H $HOSTADDRESS$ -u admin -p mysekritpassword -t disks
}
and
#
# Check cluster piper
#
define service {
host_name piper.lab.ub3rgeek.net
service_description PIPER_VOL
check_command check_nac_vols
use generic-service
notification_interval 0
}
define service {
host_name piper.lab.ub3rgeek.net
service_description PIPER_NODES
check_command check_nac_nodes
use generic-service
notification_interval 0
}
define service {
host_name piper.lab.ub3rgeek.net
service_description PIPER_DISKS
check_command check_nac_disks
use generic-service
notification_interval 0
}
I'd love to know if anyone out there finds this useful and I'd be happy to add additional checks as time permits if there are other things you might be interesting in monitoring.