Monday, May 7, 2012

debugging nagios remote nrpe commands

Nagios NRPE debugging steps:
  1. run the command manually on the target host
  2. enable debugging in nrpe.cfg and watch syslog
  3. dig in deeper with debug jobs.
Debugging nagios remote nrpe commands can feel very opaque. Normally I find my issue in step 1 of the debug escalation. Today I had to hit all three steps while debugging a test that wrapped around s3cmd. I eventually found that HOME is not set in the environment used to run nrpe commands.
check_nrpe -H 10.7.202.92 -c check_ui_s3_backup                                 
NRPE: Unable to read output
I run my check remotely and receive the dreaded general error "unable to read output." This means the script failed to run and didn't produce any output to STDOUT. STDERR seems to be ignored, even with logging enabled.

Step 1a: go to the server and verify the command being run by nrpe.

[andrew@ip-10-7-202-92]% grep check_ui_s3_backup /etc/nagios/nrpe.d/herbie.cfg
command[check_ui_s3_backup]=HOME=~postgres /usr/lib/nagios/plugins/herbie/check_ui_s3_backup
Step 1b: run the command manually. Here I find that the script fails if I don't have the config file:
[andrew@ip-10-7-202-92]% /usr/lib/nagios/plugins/herbie/check_ui_s3_backup
ERROR: /home/andrew/.s3cfg: No such file or directory
ERROR: Configuration file not available.
ERROR: Consider using --configure parameter to create one.
That should be a simple fix. Find the user running the nrpe command and give them a .s3cfg. Easy-Peasy.
cp .s3cfg ~nagios/
sudo -u nagios -H /usr/lib/nagios/plugins/herbie/check_ui_s3_backup
OK - Last backup 0 days ago.
Ok, it works locally. Recheck it remotely. It fails?!!?! This is where we start the gnashing of teeth and pulling of hair
[andrew@ip-10-7-203-10]% check_nrpe -H 10.7.202.92 -c check_ui_s3_backup
NRPE: Unable to read output
Step 2: Enable logging in nrpe.cfg, run the remote check and inspect the logs. Surprise, nothing useful.
May  7 19:08:29 ip-10-7-202-92 nrpe[17159]: Connection from 10.7.203.10 port 15286
May  7 19:08:29 ip-10-7-202-92 nrpe[17159]: Host address is in allowed_hosts
May  7 19:08:29 ip-10-7-202-92 nrpe[17159]: Handling the connection...
May  7 19:08:29 ip-10-7-202-92 nrpe[17159]: Host is asking for command 'check_ui_s3_backup' to be run...
May  7 19:08:29 ip-10-7-202-92 nrpe[17159]: Running command: /usr/lib/nagios/plugins/herbie/check_ui_s3_backup
May  7 19:08:29 ip-10-7-202-92 nrpe[17159]: Command completed with return code 3 and output: 
May  7 19:08:29 ip-10-7-202-92 nrpe[17159]: Return Code: 3, Output: NRPE: Unable to read output
May  7 19:08:29 ip-10-7-202-92 nrpe[17159]: Connection from 10.7.203.10 closed.
Step 3: debug jobs (aka printf aka "hail marry" debugging). I create two new nrpe entries and restart nagios-nrpe-server. The first will show me the user running the command and the second will show the environment, using whoami and env respectively.
command[check_ui_test]=whoami
command[check_ui_test2]=env
[andrew@ip-10-7-203-10]% check_nrpe -H 10.7.202.92 -c check_ui_test
nagios
[andrew@ip-10-7-203-10]% check_nrpe -H 10.7.202.92 -c check_ui_test2
NRPE_PROGRAMVERSION=2.12
TERM=screen-256color-bce
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
LANG=en_US.UTF-8
NRPE_MULTILINESUPPORT=1
PWD=/
Yep, the tests are running as the expected user, nagios.

Holy Schmoly! Look at that environment! HOME is not set. Simple enough to fix for my check, but wow was I not expecting that. Also useful to note the minimal PATH and that the working directory is /.

Update check to explicitly set HOME in the environment and restart nrpe:

command[check_ui_s3_backup]=HOME=~nagios /usr/lib/nagios/plugins/herbie/check_ui_s3_backup
Restart nrpe:
[andrew@ip-10-7-202-92]% sudo service nagios-nrpe-server restart
 * Stopping nagios-nrpe nagios-nrpe
   ...done.
 * Starting nagios-nrpe nagios-nrpe
   ...done.
Check:
[andrew@ip-10-7-203-10]% check_nrpe -H 10.7.202.92 -c check_ui_s3_backup
OK - Last backup 0 days ago.

11 comments:

Anonymous said...

Man, you just save me, i was going man, i just did not check this f.....g $ENV{"HOME"}, was so sure it was correct.... thx again

Anonymous said...

Man, you just save me, i was going man, i just did not check this f.....g $ENV{"HOME"}, was so sure it was correct.... thx again

Priyesh Gaikwad said...

Hi Im getting same error "NRPE: Unable to read output"

output after adding "command[check_ui_test2]=env" is as follows.

HOSTNAME=db-1
TERM=vt100
SHELL=/bin/bash
HISTSIZE=1000
NRPE_PROGRAMVERSION=2.13
SSH_CLIENT=192.168.133.32 59785 22
SSH_TTY=/dev/pts/0
USER=root
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01

Andrew Grangaard said...

Priyesh,

what is the full nagios command are you are hoping to run ( the one returning "NRPE: Unable to read output" ) ? What happens if you run that command directly? Does it require HOME to be set?

Michael said...

I've a similar problem on RHEL. The environment returned by

command[check_env]=printenv

is

# /usr/lib64/nagios/plugins/check_nrpe -H localhost -c check_env
TERM=xterm
NRPE_PROGRAMVERSION=2.13
PATH=/sbin:/usr/sbin:/bin:/usr/bin
_=/usr/bin/printenv
PWD=/
LANG=de_DE.UTF-8
SHLVL=4
NRPE_MULTILINESUPPORT=1

No complex checks (e.g. informix 'onstat', 'dbaccess', ...) are running within the environment.

Any suggestions ?

Michael said...

After hours of recherche, I found out, that SELinux was the reason of that strange behavior. Switching off SELinux solved the problem.

Ed Greenberg said...

The absence of stderr is the most difficult part of this. I wasn't sure it would work, but I put 2>&1 on the end of my command (in nrpe.d/command.cfg) and started getting output when I ran check_nrpe from the nagios server.

After that, the resolution of the problem was simple.

Stefan Freitag said...

Got stuck at a similar problem and spent hours in debugging this NRPE thing.
Thank you for this post - it helped me to solve the problem :)

Danny Howard said...

Holy crud thank you for helping me debug my bacon.

Anonymous said...

I turned on debug and found "timeout" wasn't installed on this one server. Thanks!

spiffikins said...

" I wasn't sure it would work, but I put 2>&1 on the end of my command (in nrpe.d/command.cfg) and started getting output when I ran check_nrpe from the nagios server."

GENIUS! Thanks so much - that helped me immensely in tracking down my issue. My nrpe command ran perfectly fine when I ran it manually, but not when it ran through the nagios scheduler.

Turned out that when I captured stderr, I was getting
"sudo: sorry, you must have a tty to run sudo"

Once I resolved that issue, my command worked fine from nagios.