Site reliability engineering is the application of software engineering skills and principles to monitoring and maintaining system reliability. The practice involves incident response, as well as working proactively on tools to support system maintenance and improvements.
You can think of a site reliability engineer as part medic and part streamliner. They take shifts in the company’s on-call rotation, during which they act as designated responders who manually intervene if the infrastructure system starts to show symptoms. Off rotation, site reliability engineers spend a lot of time writing code, including a significant amount of automation tooling.
What Is a Site Reliability Engineer?
What Does a Site Reliability Engineer Do?
Site reliability engineers tend to work across departments more than a traditional software engineer, and they perform proactive maintenance as well as incident response to ensure systems are up and running correctly. As Squarespace site reliability engineer John Turner put it, “it turns out that reliability is often a fairly complex problem,” and being able to drive the conversation can be key.
“You have to be able to zoom in on any particular developer’s portion of the product, and then see how that might affect the reliability of other parts of the products,” Turner added.
He spends roughly 70 percent of his time writing code — much of it automation code to help cut down on monotonous, routine tasks that are time consuming.
When Turner’s not being paged, he said he can usually be found doing smaller tasks, like writing documentation, helping junior engineers with technical guidance or finding ways to make the next person’s on-call shift a little easier — whether that’s writing automation, changing alerting windows or making visualizations of system properties look nicer.
Shifts are 24/7 and run for one week, always with a secondary on-call person, in case the designated PagerDuty holder has an emergency. Squarespace also has rules in place to avoid burnout. If you have to respond to alerts during the night, you can sleep in in the morning; if you’re awakened by alerts on three consecutive nights, you’re off rotation.
When Turner is on PagerDuty call, he doesn’t want to be occupied with anything too demanding because the prospect of interruption is forever looming.
How to Become a Site Reliability Engineer
Because site reliability engineering is so cross-functional, one of the most important skills is a soft one: communication. SRE work is a bit odd in that you don’t exactly own what you oversee, which means being thoughtful and diplomatic when working with others is paramount.
“You own sort of the meta level — the reliability, or velocity, or whatever the SRE contract is — but you don’t actually own the code … and that can be difficult. It means you need to communicate requirements and best practices in a way that doesn’t seem like a burden to service owners,” Anika Mukherji, a site reliability engineer at Pinterest, told Built In in 2020.
Needless to say, plenty of the necessary skills are hard skills too. For site reliability engineers, commonly used technologies and programs include:
- Cloud servers (AWS)
- Programming languages (Python, Java, C)
- Containers (Kubernetes, Docker)
- CI/CD (Jenkins, Travis)
- Automation (Terraform, Chef, Ansible)
- Monitoring and observability (PagerDuty, New Relic, Datadog)
Site reliability engineering is still relatively new, which means there’s no single, prescribed path to the role. Both Turner and Mukherji came to it through software engineering. Mukherji worked within the engineering team, but focused on backend performance, which she described as “an SRE-similar domain.” She noted that several colleagues started from different backgrounds.
Whatever the route, grace under uptime pressures, a knack for streamlining out mundane, repetitive tasks and a diplomatic streak are all site reliability engineering prerequisites.
What Is Site Reliability Engineering
Site reliability engineering originated internally at Google in 2003 before finding a foothold in the broader development world.
Site reliability engineering “outside of Google is still very much in its infancy, which is very interesting,” Turner said. “Some of what works for Google doesn’t or can’t work for other companies.” A designated site reliability engineer may not make sense, for instance, at a 20-person startup, he explained.
Squarespace’s interpretation of site reliability engineering sees Turner spending the majority of his time coding, including writing tools that make it easier for engineers to interact with the infrastructure.
How else does site reliability engineering look outside of Google? “It turns out it’s pretty wide and varied,” Turner said. Some operations teams have simply rebranded with no real functional change. (SRE is, in some part, a marketing term, Turner noted.)
Reliability is a quality, but it’s also a metric. Say your service runs on a single server. Does reliability simply mean the service is up?
“What if it’s up, but it’s only serving errors?” Turner said. “Or what if it’s up and not serving errors, but not serving what the user expected? You can keep asking this series of questions until you arrive at a synthetic measurement that approximates what our user wants our service to do.”
Site reliability engineers use three metrics to monitor the reliability of a service:
- A service level indicator (SLI) is the percentage at which a service runs properly.
- A service level objective (SLO) is the percentage at which a company expects a service to run properly.
- A service level agreement (SLA) is the percentage at which a service is required to run properly, as stipulated by contractual agreements.
Service Level Indicators
A service level indicator can generally be calculated using a simple formula:
SLI = Number of successful requests / number of total requests x 100
There are different SLIs, and which ones a site reliability engineer measures will vary by company. The most commonly tracked SLI is availability, or uptime.
Service Level Objectives
Once you have performance numbers, you can then establish performance targets. SLI measurements let you institute service level objectives, or numerical goals for availability, latency and whatever else is being tracked. It’s the percentage of successful requests you need to achieve, rather than the percentage you are achieving — the internal objective that makes sure expectations are met.
How do site reliability engineers establish what constitutes acceptable performance? They prioritize. For instance, Pinterest uses an internal tiering system, which ranks services by stringency requirements.
Service Level Agreements
The third foundational pillar of site reliability engineering is service level agreements. Simply put, SLAs share the SLO concept but are legally binding. These are the contractual agreements that stipulate to what degree a service must be available and performant, and what penalties are levied if those expectations are not met.
Site reliability engineers typically aren’t involved in drafting service level agreements. But needless to say, SLAs impact site reliability engineers, who are responsible for monitoring reliability. SLO trigger points can flag site reliability engineers, letting them know that an SLA breach may be on the horizon.
SRE Vs. DevOps
From a bird’s-eye view, much of SRE is about automation and streamlining interactions between infrastructure and operations. That probably sounds similar to another dev-world practice that graduated from buzz phrase to broad adoption: DevOps.
To be sure, in Seeking SRE, a Microsoft cloud advocate plainly states that “reliability engineering and DevOps aim to solve the same problem set” (keeping digital services up, while adding improvements), while a Deswik Mining release engineer opined that “SRE and DevOps have a wide scope of overlap, but they are distinct ideas.”
One of the most prominent thought leaders in the field, former Google site reliability engineer Liz Fong-Jones once likened SRE to a concrete class that implements the interface in a programming language — “a prescriptive way of accomplishing that [DevOps] philosophy.” In other words, where you draw the line may differ.
For Turner, it boils down to an aversion to silos. “It all starts from the notion that there’s a foundation of trust between teams and the best way to work is cross-functional,” he said. “There aren’t necessarily formalized handoffs between product and operation and security. Instead, you have all these teams working together, sort of all at once.”