Updated about 3 years ago by Jack Aponte

Icinga response procedure

When an icinga notification comes in to the person on call

  • forward the notification email to redmine to create a ticket, assigned accordingly
  • acknowledge the notification with a link to that ticket
  • deal with the ticket

Icinga response cheat sheet

Backupninja errors: Handled by tech support on call. backupninja reporting currently has a several hour delay between a backup successfully completing and an error being cleared. SSH into the server in question and review the latest entries for /var/log/backupninja.log, you may find that the error is no longer legit. If it IS legit, error messages in the log will provide help toward resolution.

EvtQuery failed: 1717 unknown: attempt to wake computer up, usually resolves itself (see #26702), acknowledge alert with an expiration date in 2 days.

CRITICAL: CRITICAL: Service results are stale: this means the computer hasn't been turned on in a couple weeks, attempt to wake the computer up, or just silently acknowledge it.

MSE Scan CRITICAL: No entries found: run a scan using ptr -e mse_scan, ptr -e mse_scan2, or ptr -e wd_scan, the three commands work for different ages of machine, that order is oldest to newest.

MSE Update CRITICAL: No entries found: run an update using ptr -s then go to the MSE or Windows Defender directory under Program Files, then MpCmdRun.exe -SignatureUpdate.

MSE Virus Check CRITICAL: make a ticket, run a full scan using ptr -e mse_scan_full (or mse_scan_full2 or wd_scan_full) then contact user to see if they noticed.

Drupal Service Check Time Outs: For commonbound, nec, new, pt and wrl, these can be ignored. This is a known MFPL shared server issue. It should recover itself in a few minutes.

Local Backups: This check is looking at the restorability of local backups, which is always looking at a backup in the past. This check will always say critical the day that a client switches a rotating USB drive, but should come back fine in less than 3 days, and this check only sends an email alert if it is critical for more than 3 days. If it comes back as critical once, acknowledge alert with an expiration date in 3 days.

Remote Backups: Similar to Local Backups, but verifying backups to MFPL, which often means this check will alert several days after a iz outage or backupninja errors, but it is almost always related to one of those. Acknowledge alert with a link to previous related backupninja ticket, or make new ticket if one doesn't exist.

Warning: Access Control List file not found: This is a red herring, add --no-acls to the script or backup job that is doing the restore and you will get an accurate answer on the next run #30573

Blacklist: Verify the results using mxtoolbox, and then make a ticket.

To wake computers: run ptr -e wakeonlan, or for computers at MADRE run ptr -e wakememadre.

If computer doesn't wake: check OCS to see if the computer has a specific reason why it can't be woken up, if it doesn't, make a ticket to investigate and update OCS.

Server Puppet status: Flag Jessie and/or make ticket.

Puppet status: This is the status of puppet on Windows Desktops
  • if it's a laptop, we ignore it
  • if it hasn't reported in to puppet, but also hasn't reported in to Icinga or ocs, then we assume it's off, and ignore it
  • if it has failures, it needs attention
  • if it hasn't reported in to Icinga for a critical amount of time, check and see if it still exists and is in use
  • if it has reported in to things other than puppet, but not puppet, it needs attention
    • If computer hasn't reported in, and you are able to remote in via command line and run puppet, check to make sure the puppet service is working properly.
      • See if service is started: sc query puppet
      • See if service is set to autostart: sc qc puppet
      • Start service if isn't started sc start puppet

Web/CiviCRM Icinga procedure

  1. When a Drupal Icinga alert arrives via email, the person on-call for Drupal should claim it by acknowledging the notification in Icinga and adding a comment that they're working on it. If the on-call MSP support person sees an Icinga notification come in before a Drupal team member does, the person on-call should help make sure the notification is attended to by someone on the Drupal team ASAP; they should claim it themselves if the notification's resolution seems to be within their skill set.
  2. Review the Icinga notification, then proceed with troubleshooting the problem or contacting the client.
    • Clients with Drupal or CiviCRM maintenance or support plans receive an initial 15 minutes of troubleshooting on-contract. If the problem cannot be resolved within those 15 minutes, we should create a new ticket for the issue, explain the situation, and ask them whether they'd like us to proceed with further troubleshooting off-contract.
      • Troubleshooting completed within the initial on-contract 15 minutes should be noted on the ongoing maintenance tracking ticket for the client.
    • Clients without Drupal or CiviCRM maintenance or support plans should be contacted via a new ticket with a brief explanation of the problem and asked whether they'd like us to proceed with further troubleshooting at a rush or standard rate.

Updated by Jack Aponte about 3 years ago · 14 revisions

Also available in: PDF HTML TXT

Go to top