Operations Ideals

This is an overview of operations practices that I consider ideal – things that I’d want to have in my ops environment by the time I’d run out of things to do (however unlikely), along the lines of 12-factor 2.0.

  1. Every environment is different – You’ll notice this is the only entry with a number. That’s because I think this one’s the only real certainty. Some of us may not need to concern ourselves with secrets in version control because our repo is self-hosted and its deployment loop is entirely internal. Only you will know which applies to you, so keep that in mind as you read.
  • Redundancy – Avoiding single points of failure is a high priority for a systems engineer. Ideally, redundancy should be present as low as PSUs & disk arrays, to multiple servers serving the same role & DB replicas, and as high as entirely redundant environments (multiple AWS regions, blue-green deployments etc).
  • Staging the app – If we concern ourselves with the user’s experience (we should), staging the applications is a must. If we concern ourselves with the dev’s experience (we should), a dev environment is also a must. Devs need a place to check their work & collaborate, and QA needs a place to test that work in a production-like environment.
    The package deployed to staging should also be the same package deployed to production.
  • Backups – This one can’t really be stressed enough. Backups are the most important thing to have on hand when things go south. Ideally we’ll have at least two entirely separate backup procedures e.g. database dump plus snapshots of the DB on disk, or tarring of dirs we know are important plus VM snapshots. In addition:
    • They need to be checked/tested regularly (no point in having backups we can’t use – the GitLab DB removal incident was a good wake-up call for us all)
    • They should be mirrored offsite in case the disaster is a really bad one; and
    • We need spare capacity on the infrastructure to restore them (in case we’re doing so because we lost some hardware).
  • Infrastructure as code – Nothing’s more satisfying to me than to behold an environment built in a completely reproducible way. It also vastly increases my comfort making changes and my ability to quickly revert a bad change or re-familiarize myself with components I’ve not looked at in a while. The code will also serve as a form of documentation, and will supplement backups and disaster recovery plans. As with all code and documentation, this should be tracked in version control.
    However, there will always be things that can’t be orchestrated, and steps that need to be taken on fresh systems before orchestration can be used. We need to document the process of bootstrapping new equipment and any other steps that need to be run manually. Some things that might be included here are hardening SSH, check running services & open ports (disable/uninstall anything unnecessary), updates, installing basic utilities, boostrap orchestration (dependencies, configuration etc), firewalling, and changing the timezone (UTC, please).
  • Keep the secrets a secret – We should have a solution in place to allow us complete control over secrets used in the infrastructure and apps. Ensure they’re never checked into version control, but that they’re readily on hand when needed, and can be securely accessed by the systems that use them.
    This applies to secrets used by people, too. A credential management system is incredibly valuable when assigning and/or sharing credentials to or among staff.
  • TLS all the things – Unless we’re at pretty high scale and have significant CPU constraints, we’re unlikely to notice the CPU overhead of a TLS termination (and if we are, we’ve got bigger problems). If we can, we should have TLS available (preferably enforced) on all services available from the infrastructure to safeguard the data traversing the network. If the private network isn’t entirely under our control (cloud or managed services), TLS should be applied to it also. It’s important to check the config with something like SSLTest or testssl.sh as default TLS config suffers from very poor defaults on most server platforms in the name of backward compatibility.
  • Keep the systems up to date – We don’t have to look far for examples of compromise caused by vulnerable, outdated software. System, runtime, and dependency updates should be run as regularly as is reasonable. Where possible keep an eye out for vulnerabilities that might warrant an immediate update (e.g. Ubuntu Security Notices). We should try to arrange for the infrastructure to be capable of receiving routine maintenance without downtime to reduce friction when updates are needed, though this can be costly depending on the architecture of the app.
    If we can, consider having an external vulnerability scan run regularly to catch any vulnerabilities in the applications as well.
  • Keep ourrselves up to date – Like any other tech field, things in ops are constantly changing. We should try to keep abreast of new technologies, updated best practices, and keep our cogs constantly turning by regularly reading industry news (HN, Slashdot, Ars, Schneier, LWN), blogs or mailing lists from our vendors/apps/runtimes & other interesting material (e.g. engineering blogs).
  • Monitoring – Monitoring is a big one, because there are many facets to having good coverage. We want a few different types:
    • Broad – We should have something that collects a lot of metrics by default because we can’t always know in advance what we might want to see when issues crop up. Also, the broader the data collection, the more we can catch potential issues before they become problems. Munin is a good example, and it also has a lot of sensible alerting thresholds by default. It’s also worth a remind to not overlook the hardware (PSUs, disk arrays etc) if physical equipment isn’t regularly monitored in-person.
    • Narrow – Narrow monitoring is for the things we do know in advance that we want to track and alert on. Something along the lines of Nagios is good for this.
    • Applications – Not all interesting metrics are found on the server level. Indeed, depending on the application, some of the most useful metrics are found in the app itself. We should have a service available for the applications to report metrics to, something along the lines of StatsD. If the runtime offers it, we should pull any interesting metrics out of that too (memory status, time spent in various calls etc).
    • External – Internal monitoring isn’t going to be able to alert us if we lose a router/uplink, or there was a power outage. We should have something outside the environment monitoring the monitors. There are a staggering number of choices here, at wildly varying price points.
    • Alerting and escalation – We need more than one person on alerts so that if one isn’t immediately available the other can get in quick. If we’re a small operation we can just have two on the alerts with the understanding that each will respond if available, but if we’re larger we can have a rotation and escalate if the first line isn’t available (PagerDuty is a popular option for this).
    • System mail – Linux itself likes to alert us if things go wrong. By default though, these alerts just hang out on the system and we don’t see them until we next login. We can forward that mail from the system to the actual email accounts, and ditto for cron mail.
    • Status page – If we have a lot of eager users and they tend to notice when things aren’t running smoothly, we can consider setting up a status page to let them know when issues crop up (StatusPage is a popular, albeit quite pricey, option for this).
  • Consider outsourcing the real hard/time consuming stuff – Unless we’re Facebook, we can’t do everything ourselves. Much like we don’t build our own servers, there will be other things we’re not good at too. We can consider outsourcing things that can be purchased for a reasonable price from a reputable vendor that would occupy too much for our time or need 100% availability. Production email and DNS are good examples.
  • Be lazy – The ultimate goal is to automate ourselves out of a job (however unlikely we are to eventually succeed). When we find ourselves doing something mindless more than a couple of times, we should consider automating it. Where possible, I find it best to use something central to manage the regular tasks instead of relying on cronjobs etc (Jenkins is a good choice).
  • Secure the public-facing network (and maybe the private one, too) – It’s good security posture to have some way of routinely ensuring the public network isn’t exposing things it ought not to, and alerting us if it is. Routine port scans (nmap is easily automated), vulnerability scans (there are a few reasonably priced PCI compliance vendors who’ll do the vuln scanning alone), and a Google alert for some known sensitive info is a good start. If the private network isn’t guaranteed private (read: “cloud”) we should apply the same principals there also. Centralized firewalling is great if we can afford it, and network segmentation can be considered where it would be helpful. If we can manage it, the applications can do with some protection also – we can consider a WAF (e.g. mod_security).
    We should also ensure any remote-access endpoints are well protected (solid config on VPNs and SSH etc), and consider how the machines themselves are protected (SELinux, audit logging etc). It’s also worth mentioning leaking data, like having verbose logging, headers, errors on production, or publicly exposed storage should be checked for.
  • Always apply the principal of least privilege – All the crons probably don’t need to run as root, all the users probably don’t need sudo access on all the machines, and we probably don’t need that default superuser with default credentials.
  • Centralize logging – Centralized logging provides multiple benefits – we have all the logs in one place which allows for efficient searching & broad alerting, and also keeps copies off the box in case of a breach. If we find ourselves lacking some detail, we can enable auditd or SELinux in permissive mode (both reasonable steps regardless). The ELK stack is popular here.
  • Continuous integration and deployment – If applicable to the environment, we should consider setting up a CI/CD pipeline. Have each commit built and deployed to the dev environment. When it comes time to a production deployment we should try to keep routine deployments zero-downtime to reduce friction and the negative effects of a bad deploy.
  • Be familiar with the codebase – Throwing code or a binary over the wall from dev to ops is something to be left to Fortune 500s of the world. In a smaller org, I like to think of the ops guy as a part of the dev team. As such, we should be familiar enough with the codebase to at least fix small bugs and diagnose issues at runtime. If we have the chance, we can find a few small tickets and resolve them ourselves (with the dev team’s blessing of course). We also ought to generally keep abreast of dev work that’s underway (tickets, commits etc) in addition to of course being involved in the larger roadmap.
  • 2fa – When we become a target, the bad guys getting access to our vendor accounts can be scarily trivial. Enable 2fa everywhere it’s available and request additional lockdown from any vendor who offers it (e.g. “please don’t reset credentials for just anyone who happens to know my mother’s maiden name”).
  • Use a shared mailbox for vendor accounts – In the same way documentation is made available to everyone to whom it’s relevant, so should access to vendor accounts. Not every vendor provides the ability to have additional users (if they do, using it is preferable), so we should have a shared mailbox for shared accounts. This means that no one will have to go trawling through our email in an emergency, and multiple people can keep an eye out for urgent messages.
  • Keep an eye on cloud costs – It can be easy to overlook the cost of the cloud because everything seems so affordable at lower scale. However, costs can get radically out of control if we’re slashdotted, reddited, HN’d, or something goes wrong. Unlike with traditional vendors we’re unlikely to get a call if things go nuts. Even worse, if we’re compromised we could be on the hook for resources we didn’t even use. Cloud vendors typically provide billing alert functionality to keep us from having to worry about it (e.g. CloudWatch).