DevOps Group Therapy
Attendees:
John H. Robinson, IV, UC San Diego
Ryan Rotter UMich
Alicia Cozine, DCE
Glen Horton
Rob Kaufman
Chad Nelson, Temple
Colin Gross, UMich
Daniel Sanford, CHF
Daniel Peirce
Randall Floyd
Randall Embry, Indiana U
Matthew Barnett, U of Alberta
Alex Dunn, UCSB
Chris Beer, Stanford
Kate Lynch, U Penn
Erin Fahy, Stanford
DISCUSSION:
Do you restrict devs from touching prod machines?
Some folks offer read-only access, to look at log files, state, etc. ā using other tools like Splunk to meet these needs. Automated and continuous deployment may also help ā end goal is for NOBODY to log into prod machines. Containerization helps with this too. Make production predictable and automated.Ā Do all debugging on pre-production environments. Issues with data ā if youāre only using staging for debugging, how do you get access to all the objects ā mirroring is problematic with multiple heads on one fedora. Restore backups to staging (also serves as a test of disaster recovery).Ā Pair up for all ssh-access activities, to make sure nobody fat-fingers an `rm ārf` or something.
Ā
A way to secure production servers without dividing into teams/silos ā use tools to serve everyoneās needs ā splunk, Capistrano, etc., to allow devs access to the information they need. The ideal goal is that nobody ssh's onto production servers - all maintenance is done with tools.
Ā
Centralized logging has challenges ā sometimes in logstash stacktraces come in line by line and are hard to see together. Killer feature for centralized logging is to show me the logs for a single box.
Ā
Kafka messaging queue ā Erin is looking into this, could give you a ādata backboneā that includes logs and also performance stats ā making tracing OOM error causes much simpler. Early beta at Stanford, but ultimately will replace the way the ELK stack gets populated. Gives a broader context. Setup seems complicated. Seems like a transformational tool.
Ā
What are folks using for monitoring?
Some use DynaTrace, others New Relic for performance profiling/monitoring, AWS tool ā these are paid options. AWS tool you canāt configure the persistence of the logging. Site247 is another integrated solution ā solid and reliable, default settings are useful. About half the group is using Nagios. Others use email, but that generates too many false positives. O-sec for logging, security, monitoring. WhatāsUP handled by a 24-7 operations team. Puppet with nagios ā should be easy but isnāt. Application monitoring is esp. tough ā looking for devs to define what their systems need to stay happy and healthy.
Ā
Options for doing this ā use the gem OKComputer gem (replaced IsItWorking) endpoint that defines checks ā the applications must have this to run. Can also build a status controller mounted in rails at mysite.com/status ā put in what is wrong. OKComputer is pretty cheap (in resources) to run. OKComputer is very simple ā no configuration really necessary.
Ā
Hydranauts are looking for ways to share systems/devops information the way we already share code. One option is to share Ansible roles and playbooks. Create a reference implementation of the Ansible scripts that can be used to review new apps ā if it runs on our reference box, then itās a viable Hydra Head.
Ā
Letās set up a pattern ādo it rightā on HyBox ā use defaults as much as possible, if you diverge, document it and save it. Standardize practices. Testing in Puppet and Ansible would help with this.
Ā
Announcement: Erin Fahy is now Queen of Puppet Mountain
Ā
Nobody using Chef, one using Puppet, many using Ansible.
Ā
iLO - Integrated Lights Out (HP), similar to iDRAC (Dell) or IPMI (IBM). Remote management of console (bare metal) Ā forced use of Ansible since Puppet client can't work in that limited environment.
Ā
Docker with or without CoreOS could be what we converge on. Thatās solving a different problem. Many docker images donāt do anything with logs (including rotate them), and can present security issues.
Ā
Look at Islandora Docker images for a template? Theyāve done Islandora Claw, but they seemed like they were doing heavy lifting. HyBox has done docker images for Solr and Fedora.
Ā
Future topics:
Ā
Fedora clustering ā nobody in production with this yet ā some folks testing, seems to be working
Data Centers
Web hosting
Getting rid of single points of failure ā fine with 32 users, but a problem
Ā
Reinvigorate the devops interest group calls?
Use project hydra slack, devops channel to propose ideas and confirm times for future calls.
Ā