With the new version of Chef we have more options and more features and an even better knife status command, which brings us to the discussion at hand which is how to alert for stale nodes on Chef using Nagios:-
The knife status command is used to display a brief summary of nodes on a Chef Server:-
knife status (options)
When used with -H switch it gives us the time on when the last successful Chef run was excluding nodes which ran in the past hour e.g:-
knife status -H
20 hours ago, dev-vm.nclouds.com, ubuntu 10.04, dev-vm.nclouds.com, 10.66.44.126 3 hours ago, i-225f954f, ubuntu 10.04, ec2-67-202-63-102.compute-1.amazonaws.com, 67.202.63.102
We can use this command to help us in alerting for stale nodes with a small script in ruby and some settings in nagios. Let’s start with the ruby script first:-
#!/opt/chef/embedded/bin/ruby
require 'rubygems'
require 'chef/config'
require 'chef/rest'
require 'chef/search/query'
##Define hours to be alerted upon and chef client.rb path so the script can execute knife status command
critical = 12
warning = 1
Chef::Config.from_file(File.expand_path("/etc/chef/client.rb"))
OK_STATE = 0
WARNING_STATE = 1
CRITICAL_STATE = 2
UNKNOWN_STATE = 3
if warning > critical || warning < 0
puts "Warning: warning should be less than critical and bigger than zero"
exit(WARNING_STATE)
end
query = Chef::Search::Query.new
all_nodes = []
cnodes = []
wnodes = []
query.search('node', "*:*") do |node|
all_nodes << node
end
all_nodes.each do |node|
hours=(Time.now.to_i - node['ohai_time'].to_i)/3600
if hours >= critical
cnodes << node.name
elsif hours >= warning
wnodes << node.name
end
end
if cnodes.length > 0
puts "CRITICAL: "+cnodes.join(',')+" did not check in for "+critical.to_s+" hours"
exit(CRITICAL_STATE)
elsif wnodes.length > 0
puts "Warning :"+wnodes.join(',')+" did not check in for "+warning.to_s+" hours"
exit(WARNING_STATE)
elsif cnodes.length == 0 and wnodes.join(',') == 0
puts "OK: All nodes are ok!"
exit(OK_STATE)
else
puts "UNKNOWN"
exit(UNKNOWN_STATE)
end
Now in the above script if a certain node has not checked in within the 12 hours time period defined we will put it in CRITICAL STATE and generate an alert with the following settings in Nagios:-
Please note that this machine needs to be able to connect to the Chef-Server using knife as we defined in the script.
Install the script in your Nagios plugins directory like :-
cp check_chef_nodes.rb /usr/lib64/nagios/plugins/check_chef_nodes.rb
Then in the nagios configuration define the command, host and service like this:-
define command {
command_name check_chef_node_status
command_line $USER1$/check_chef_nodes.rb
}
define host {
use linux-server
contact_groups admins
address 127.0.0.1
host_name localhost
}
define service {
use local-service ; Name of service template to use
host_name localhost
service_description Chef Node Health Check
check_command check_chef_node_status
notifications_enabled 0
}
Once everything is configured restart nagios and you should see a service monitor for Chef Node Check Health.
That’s all for today folks, now you have an alert on stale nodes on chef-server and can take steps to ensure all your nodes are up to date accordingly.

