Monitoring Hydra - Hydra Connect 2016

Hydra Connect 2016, Boston

October 6, 2016

Note: Some monitoring topics were also touched on in the earlier DevOps Group Therapy session (see https://wiki.duraspace.org/display/hydra/DevOps+Group+Therapy)

Stanford:

Erin Fahey (the Queen of Puppet Mountain) reported that Stanford is starting to use the OK Computer gem (https://github.com/sportngin/okcomputer) for monitoring.
Getting better at documenting necessary dependencies vs. optional dependencies.
Learning how to best generate alerts.
Started using OpsGenie for their on-call rotation.

Fedora discussion:

Stanford uses a certificate to allow Nagios to interact with Fedora.
Some just look for a recent update to the Fedora log to make sure it’s up and logging.
Nagios is being used to monitor Fedora ports.
The JVM itself can be monitored, but that provides limited benefit.
The group agreed that it seems like there should be a better way to monitor Fedora’s health. What would we like to see on Fedora’s /status page?
Monitoring PostgreSQL as a Hydra backend would be a lot simpler than monitoring Fedora. Many in the room know how to monitor PostgreSQL, but much less familiar with monitoring Fedora’s health.
Several institutions reported that they have Tomcat set to restart every night to deal with memory leaks (not necessarily Fedora).

Monitoring is getting complicated enough that even our monitors need monitoring.

Documentation:

Carolyn Ann Cole shared that Penn State is using Nagios for monitoring. They are building documentation on how to respond to specific alerts. The best time to create documentation is while you are responding to an alert.
It’s a good idea to have a link to your documentation embedded in the alerts themselves.
Some are using GitHub or BitBucket repos just to host documentation. Some are keeping credentials and other sensitive info in private repositories. Cincinnati uses private GitHub Enterprise repositories that can only be reached on campus or via VPN.
Stanford also keeps operations concerns in their documentation. “Weird things about this app…” etc.

Alerts:

Stanford wants all their alerts to go through OpsGenie. Some alerts go to dev team and some go to ops team. No more mass emails because no one feels responsible for those alerts when they see them. OpsGenie keeps track of who is on call.
Stanford workflow starting to take shape: each alert has a window of response time. High level alerts require action 24/7. Low level alerts only require action during work hours. This also helped them identify which of their services were more important (the ones with “high” status).
Stanford service owners are also alerted because it’s their responsibility to send out service outage messages to users, etc.
Critical alerts may also be an indication that load balancing or redundancy is needed for those services so they are less likely to be down.

Samvera