DevOps Group Therapy

Attendees:

John H. Robinson, IV, UC San Diego

Ryan Rotter UMich

Alicia Cozine, DCE

Glen Horton

Rob Kaufman

Chad Nelson, Temple

Colin Gross, UMich

Collin Brittle

Daniel Sanford, CHF

Daniel Peirce

Randall Floyd

Randall Embry, Indiana U

Matthew Barnett, U of Alberta

Alex Dunn, UCSB

Chris Beer, Stanford

Kate Lynch, U Penn

Erin Fahy, Stanford

DISCUSSION:

Do you restrict devs from touching prod machines?

Some folks offer read-only access, to look at log files, state, etc. – using other tools like Splunk to meet these needs. Automated and continuous deployment may also help – end goal is for NOBODY to log into prod machines. Containerization helps with this too. Make production predictable and automated. Do all debugging on pre-production environments. Issues with data – if you’re only using staging for debugging, how do you get access to all the objects – mirroring is problematic with multiple heads on one fedora. Restore backups to staging (also serves as a test of disaster recovery). Pair up for all ssh-access activities, to make sure nobody fat-fingers an `rm –rf` or something.

A way to secure production servers without dividing into teams/silos – use tools to serve everyone’s needs – splunk, Capistrano, etc., to allow devs access to the information they need. The ideal goal is that nobody ssh's onto production servers - all maintenance is done with tools.

Centralized logging has challenges – sometimes in logstash stacktraces come in line by line and are hard to see together. Killer feature for centralized logging is to show me the logs for a single box.

Kafka messaging queue – Erin is looking into this, could give you a “data backbone” that includes logs and also performance stats – making tracing OOM error causes much simpler. Early beta at Stanford, but ultimately will replace the way the ELK stack gets populated. Gives a broader context. Setup seems complicated. Seems like a transformational tool.

What are folks using for monitoring?

Some use DynaTrace, others New Relic for performance profiling/monitoring, AWS tool – these are paid options. AWS tool you can’t configure the persistence of the logging. Site247 is another integrated solution – solid and reliable, default settings are useful. About half the group is using Nagios. Others use email, but that generates too many false positives. O-sec for logging, security, monitoring. What’sUP handled by a 24-7 operations team. Puppet with nagios – should be easy but isn’t. Application monitoring is esp. tough – looking for devs to define what their systems need to stay happy and healthy.

Options for doing this – use the gem OKComputer gem (replaced IsItWorking) endpoint that defines checks – the applications must have this to run. Can also build a status controller mounted in rails at mysite.com/status – put in what is wrong. OKComputer is pretty cheap (in resources) to run. OKComputer is very simple – no configuration really necessary.

Hydranauts are looking for ways to share systems/devops information the way we already share code. One option is to share Ansible roles and playbooks. Create a reference implementation of the Ansible scripts that can be used to review new apps – if it runs on our reference box, then it’s a viable Hydra Head.

Let’s set up a pattern “do it right” on HyBox – use defaults as much as possible, if you diverge, document it and save it. Standardize practices. Testing in Puppet and Ansible would help with this.

Announcement: Erin Fahy is now Queen of Puppet Mountain

Nobody using Chef, one using Puppet, many using Ansible.

iLO - Integrated Lights Out (HP), similar to iDRAC (Dell) or IPMI (IBM). Remote management of console (bare metal) forced use of Ansible since Puppet client can't work in that limited environment.

Docker with or without CoreOS could be what we converge on. That’s solving a different problem. Many docker images don’t do anything with logs (including rotate them), and can present security issues.

Look at Islandora Docker images for a template? They’ve done Islandora Claw, but they seemed like they were doing heavy lifting. HyBox has done docker images for Solr and Fedora.

Future topics:

Fedora clustering – nobody in production with this yet – some folks testing, seems to be working

Data Centers

Web hosting

Getting rid of single points of failure – fine with 32 users, but a problem

Reinvigorate the devops interest group calls?

Use project hydra slack, devops channel to propose ideas and confirm times for future calls.

Samvera

DevOps Group Therapy

Related content