What Is a Site Reliability Engineer? What Does an SRE Do?

Squarespace and Pinterest SREs talk automation, error budgets and how site reliability is — and isn’t — like DevOps.
Stephen Gossett
Staff Reporter
September 21, 2021
Updated: March 7, 2022
Stephen Gossett
Staff Reporter
September 21, 2021
Updated: March 7, 2022

John Turner wanted to be a rock star. Not a rock-star developer, not a rock-star software engineer, not anything having to do with the weird corporate-speak contortion of “rock star.” A legit rock star.

A metal lover, he fine-tuned his chops studying jazz guitar in Atlanta, then went to work as a professional workaday musician after graduation. But prestige industries sometimes feel less prestigious from the inside, and he soon found himself feeling “very dissatisfied with the day-to-day life of a working musician.”

“It’s not as glamorous as people might think,” he said.

What Is a Site Reliability Engineer?

A site reliability engineer (SRE) monitors and helps stabilize services in production, sets and maintains acceptable performance and availability thresholds (service level objectives), writes code that automates repetitive tasks (toil), and works on-call shifts responding to alerts. The role originated at Google and began to spread more broadly around 2016 after it published a collection of essays about how the role manifests within the company.

That disillusionment led him to New York University’s music technology graduate program, where he could bridge music with lifelong flirtations with programming — including childhood dabblings in Hello, World! and Delphi. But it was an assignment that had nothing to do with music — building a rudimentary spellchecker with C — that may have been most formative.

“It was that taste of pure creation that I eventually fell in love with.”

“It was that taste of pure creation that I eventually fell in love with,” Turner said. “And it’s the thing that keeps me writing software day to day.”

The software he writes now, as a site reliability engineer (SRE) at website-hosting service Squarespace, is more complicated by degrees of magnitude. But before we get into the workaday details, it’s worth clarifying:

 

Just What Is a Site Reliability Engineer?

In a manner of speaking, SREs are part medics, part streamliners. They take shifts in the company’s on-call rotation, during which they act as designated responders, available to manually intervene if the infrastructure system starts to show symptoms. Off rotation, they’re spending a lot of time writing code, including a significant amount of automation tooling. Automation helps cut out toil — or the monotonous, routine tasks that would have otherwise become time-sucks for developers — which in turn helps stabilize the system.

john turner squarespace site reliability engineer“Computers are really good at doing automated tasks over and over — much better than most humans,” said Turner (left). The critical and creative freedom such automation affords becomes a boon for both reliability and incident response. “Having humans not be stressed out because they’ve just had to engage in monotonous toil turns out to be a really helpful thing — that, to me, is the big industry-wide takeaway from SRE,” he said.

 

The Challenge of Measuring Toil

That said, toil is a notoriously difficult thing to calculate. In the 2021 Catchpoint SRE Report, which surveyed some 300 SREs, only 22 percent of respondents said toil was measured scientifically at their workplace. The other respondents said they either measured toil anecdotally, were trying to measure it but found it difficult to do so, or did not track it at all.

Anika Mukherji, an SRE at Pinterest, told Built In that she’s relied on both anecdotal and more systematic methods. She helped her previous team, the platform team, track false positive rates and incident totals — “time sucks for everyone involved,” she said. Now, her current team completes weekly surveys where everyone estimates how much time they spent in meetings and on basic, keeping-the-lights-on infrastructure maintenance.

Further proof of toil’s slippery nature: The Catchpoint survey reported a whopping 15 percent year-over-year drop in toil, the most precipitous drop registered in the four years since the annual survey launched. And yet, the authors couldn’t pinpoint a reason for the decline. They hypothesized that the shift to work from home may have been the driving force, but couldn’t say for certain.

That said, some pretty clear — and logical — takeaways emerged in regards to toil. For example, more than 40 percent of respondents cited too much technical debt as the major cause of toil.

 

Translating SRE from Its Google Roots

Just as we’re still figuring out the best way to track toil, the role of SRE itself is still germinating in some respects. Site reliability engineering, as both concept and practice, comes from a long tradition that also includes tools like Gerrit and Kubernetes — it originated internally at Google before finding a foothold in the broader development world.

But most companies are smaller than Google. Indeed, when Google in 2016 published the field’s ur-text, a collection of essays by then-current and former SREs, editors caveated their findings somewhat by underscoring the company’s unique status:

“It is no surprise that [SRE] arose in the fast-moving world of web services, and perhaps in origin owes something to the peculiarities of our infrastructure,” they wrote.

Translation, as it were, is still in progress. Site reliability engineering “outside of Google is still very much in its infancy, which is very interesting,” Turner said. “Some of what works for Google doesn’t or can’t work for other companies.” A designated SRE may not make sense, for instance, at a 20-person startup, he said.

“SRE outside of Google is still very much in its infancy, which is very interesting.”

Squarespace has experienced both, having grown from scrappy startup to a 1,000-plus-employee, if not quite Alphabet-sized, enterprise. It first hit the gas on expansion roughly between 2010 and 2014, a stretch that included the company’s legendary, diesel-lugging efforts to keep thousands of customers’ sites running after Hurricane Sandy.

Squarespace’s interpretation of site reliability engineering sees Turner spending the majority of his time coding, including writing tools that make it easier for engineers to interact with the infrastructure, and the all-important, aforementioned automation tools.

How else does site reliability engineering look outside of Google? “It turns out it’s pretty wide and varied,” Turner said. Some operations teams have simply rebranded with no real functional change. (SRE is, in some part, a marketing term, Turner noted.) Others approach Google’s example like a cafeteria, picking what works and adapting what doesn’t.

That definitional question helped spawn another key book, Seeking SRE. In it, one contributor, Coburn Watson, who’s now the head of infrastructure and SRE at Pinterest, argues for a context-, rather than control-, driven approach to improving availability of microservices.

In effect, skip micro-managerial direction and guide engineering teams by providing them with plenty of information, from trended availability data to high-level business-review documents. “That’s an interesting way to think about it — providing context to the rest of the engineers to help them make good decisions around what makes the service more reliable,” Turner said.

RelatedThe Secrets of Leveling Up Junior Engineers

 

pinterest service levels site reliability engineer
Pinterest headquarters in San Francisco. The company uses an internal tiering system when setting service level objectives. | Photo: Shutterstock

How Do You Measure Site Reliability?

Reliability is a quality, but it’s also a metric. Turner describes a scenario not far removed from a toddler repeatedly asking “why” — except, you know, helpfully. Say your service runs on a single server. Does reliability simply mean the service is up?

“What if it’s up, but it’s only serving errors?” Turner said. “Or what if it’s up and not serving errors, but not serving what the user expected? You can keep asking this series of questions until you arrive at a synthetic measurement that approximates what our user wants our service to do.”

How Is Service Quantified?

SREs use three metrics to monitor the reliability of a service: A service level indicator (SLI) is the percentage at which a service runs properly. A service level objective (SLO) is the percentage at which a company expects a service to run properly. A service level agreement (SLA) is the percentage at which a service is required to run properly, as stipulated by contractual agreements.

 

Service Level Indicators

Those measurements are generally expressed in a simple formula, the solution of which is called a service level indicator (SLI):

SLI = Number of successful requests / number of total requests x 100

But which SLIs are measured by SREs varies by company. The most commonly tracked SLI is availability, or uptime. In the Catchpoint survey, 85 percent of respondents said they track availability/uptime. The second most commonly tracked SLI was performance/response times (77 percent), followed by latency (66 percent), error count (62 percent) and throughput (38 percent).

In those numbers, you can see the maturation of site reliability engineering, said Leo Vasilou, of Catchpoint. The first pillar of site reliability engineering was, and remains, availability. But the landscape has evolved to often include facets like performance — “where we hear expressions like ‘slow is the new down’” — and reachability — “what good is being highly available or highly performing at the source if your service cannot reach users where they are?” Vasilou said.

Mukherji, the Pinterest SRE, said that the social media site now applies the SLI framework to internal speed and performance concerns as well, such as commit-to-production time and site request speeds. Service metrics that impact the business and that you can conceivably control are potential candidates. 

“You want [SLIs] to be both meaningful and actionable,” she said.

 

Service Level Objectives

Once you have performance numbers, you can then establish performance targets. SLI measurements let you institute service level objectives (SLOs), or numerical goals for availability, latency, and whatever else is being tracked. It’s the percentage of successful requests you need to achieve, rather than the percentage you are achieving — the internal objective that makes sure expectations are met.

How do SREs establish what constitutes acceptable performance? They prioritize. For instance, Pinterest uses an internal tiering system, which ranks services by stringency requirements. Services that don’t impact advertisers or user experience fall into lower tiers, so they have minimum SLOs. But a tier-one service — which includes anything that could bring down the product — obviously requires more resilience. At the top tier, Mukherji said, “you’re going to need three nines” — or 99.9 percent.

Sometimes, setting SLOs requires time and leeway for reassessment. A service that has more wiggle room than, say, availability, might initially be set to whatever its current SLI is, to make sure the status quo doesn’t fall, Mukherji said. From there, the team can evaluate whether boosting that SLO percentage is desirable and feasible.

Every SLO also leads to what’s called an error budget. That’s simply the difference between 100 percent and the SLO percentage. So a 99.9 percent SLO would have an error budget of 0.1 percent. In other words, it’s the amount of allowable time things can go wrong.

What happens when something goes over budget? Mukherji said the team meets, with a one-pager overview of the episode, to determine what — if anything — should be addressed. Oftentimes an error-budget tapping is the result of an unforeseeable outage. (There’s a reason 99.9 percent isn’t 100 percent.) Or perhaps one failure knocked down one of its dependencies — a downed API domino-ing the web application, for instance. In other words, not much to be done. But other situations might call for more intervention. If a rate-limit error budget is routinely exceeded, for example, that might be a sign of an inappropriately set target. 

“Sometimes running out of error budget is actionable and sometimes it’s not,” Mukherji said.

Do SREs regularly adjust those numbers? It seems to be evenly split. Fifty percent of respondents in Catchpoint’s survey said that they “continually refine” SLOs. (What constitutes “continual” was open to interpretation.) 

Mukherji, for one, prefers a wait-and-watch approach. Outside of a bump-up demand from leadership, SREs at Pinterest tend to investigate the appropriateness of an SLO percentage when a service depletes its error budget. Generally speaking, that approach is driven by the fact that SLO measurement is a tricky science. More than two-thirds of respondents in the Catchpoint survey said it’s not easy to either choose SLOs or to find the right data to support one’s SLOs.

SLOs tend to remain internal, though a minority of companies opt to make some external too. A third of respondents in the Catchpoint report said they publish SLOs for clients and users. The report didn’t state whether that included publishing whether SLOs were achieved or not, but it’s easy to see why a company might not want to broadcast a missed objective. 

“If a tree falls in the woods and no one’s around to hear it, then did it really fall?” said Vasilou, interpreting the mindset behind not publicizing.

At the same time, those numbers may simply not be of interest to most users, particularly outside of B2B contexts. That said, Mukherji said she could see a rationale for certain situations, such as SLOs related to external-facing APIs or for taking part in larger engineering community discussions.

 

Service Level Agreements

The third foundational pillar of site reliability engineering is service level agreements (SLAs). Simply put, SLAs share the SLO concept but are legally binding. These are the contractual agreements that stipulate to what degree a service must be available and performant, and what penalties are levied if those expectations are not met. Penalties might include a client service credit or refund, be it full or a percentage of the bill.

SREs typically aren’t involved in drafting service level agreements. That falls to legal teams, perhaps with input from finance and business procurement departments, and higher-level IT functions. But needless to say, SLAs impact SREs, who are responsible for monitoring reliability. 

That distance, between those who draft SLAs and those internally who are much affected by them, has been known to cause some headaches. One way to help minimize those is to set SLO trigger points earlier than in SLAs. That way, they serve as a canary in the coal mine, red-flagging SREs that an SLA breach may be soon in the offing.

Taken together, these service indicators and targets underscore site reliability engineering’s central concept: reliability as a measurable data point — “not just a gut feeling or a personal perspective,” as Turner said — and the concrete steps taken to improve that point.

 

A Day in the Life of a Site Reliability Engineer

So how, exactly, does that manifest day to day? For laypeople, “site reliability” likely conjures up images of putting out digital fires, fixing outages — or, preferably, precursors to outages. But Turner spends some 70 percent of his time writing code — much of it automation code.

“To take a trivial example, if a computer needs to be restarted, one way is to hit the virtual restart button,” Turner said. “Another way is to write something that does it from your computer. And then another way is to do something that automatically monitors the machine, to see if it needs to be restarted, and just restart it without any humans. So it’s that level of writing automation tooling.”

Each day also typically brings a one-on-one, either with a colleague or an earlier-career developer.

One-on-ones with other engineers can be particularly fruitful for clearing technical roadblocks. It’s a time to ask: “Have you seen a similar [issue]? How did you fix it? Do you know other people who are doing this across the organization that I can talk to?”

It’s also an opportunity to gather some high-level career advice. “It might be: ‘Hey, I’m getting really interested in distributed systems. What should I be doing to work on these more?’” Turner said.

Likewise, check-ins with more entry-level contributors range from double-checking work through pair code reviews to lending advice on technical interests.

“A lot of people are still very interested in Kubernetes,” said Turner, whose team works closely with container technology. “It’s certainly not the newest tech on the market certainly anymore, but it’s still very exciting, complex and will likely be a building block for important systems down the road.”

The better young programmers grasp the technical details of containers, the more valuable they’ll be, he added.

 

site reliability engineer squarespace int
Squarespace headquarters. | Photo: Squarespace

Cross-Departmental Collaboration

That points to the importance of leadership. SREs tend to work across departments more than a traditional software engineer. As Turner put it, “it turns out that reliability is often a fairly complex problem,” and being able to drive the conversation can be key. “You have to be able to zoom in on any particular developer’s portion of the product, and then see how that might affect the reliability of other parts of the products,” he said.

That leadership aspect manifests away from the codebase too. At the beginning or end of quarters, Turner and other senior engineers and contributors are either running postmortems or sussing out the technical details of new projects. “What do the solutions look like? What are the dependencies between different types of projects?” he said.

Turner also spends time contributing to architecture reviews, or “making sure [product] designs are technically sound and that all contingencies have been thought through.” Members from across the engineering team contribute; Turner represents the reliability/infrastructure side, “especially if it has to do with containers or Kubernetes, which is sort of my expertise,” he said.

If you have to respond to alerts during the night, you can sleep in in the morning; if you’re awakened by alerts on three consecutive nights, you’re off rotation.

That said, the day-to-day changes complexion when Turner is on PagerDuty call. You don’t want to be occupied with anything too demanding when the prospect of interruption is forever looming.

“When I’m not being paged, I’ll do smaller tasks like writing documentation, helping people with technical guidance, or finding ways to make the next person’s on-call shift a little bit easier: writing automation, changing alerting windows, making visualizations of system properties look a little bit nicer, whatever I can do to make the next person’s life a little easier,” he said.

Shifts are 24/7 and run for one week, always with a secondary on-call, in case the designated PagerDuty holder has an emergency. Squarespace operates what’s known as a tiered escalation policy, a failsafe that notifies a broader range of members if designated responders were incapacitated.

Rules are also in place to avoid burnout. If you have to respond to alerts during the night, you can sleep in in the morning; if you’re awakened by alerts on three consecutive nights, you’re off rotation. “Health and life concerns are really important to acknowledge,” Turner said. “People are not an expendable resource.”

RelatedEngineering Leaders Discuss the Best Programming Languages

 

Who Should Become an SRE?

Because site reliability engineering is so cross-functional, one of the most important skills is a soft one: communication. SRE work is a bit odd in that you don’t exactly own what you oversee, which means being thoughtful and diplomatic when working with others is paramount.

“You own sort of the meta level — the reliability, or velocity, or whatever the SRE contract is — but you don’t actually own the code … and that can be difficult. It means you need to communicate requirements and best practices in a way that doesn’t seem like a burden to service owners,” Mukherji said.

Needless to say, plenty of the necessary skills are hard skills too. For SREs, commonly used technologies and programs include:

  • Cloud servers (AWS)
  • Programming languages (Python, Java, C)
  • Containers (Kubernetes, Docker)
  • CI/CD (Jenkins, Travis)
  • Automation (Terraform, Chef, Ansible)
  • Monitoring and observability (PagerDuty, New Relic, Datadog)

Site reliability engineering is still relatively new, which means there’s no single, prescribed path to the role. Both Turner and Mukherji came to it through software engineering, but Mukherji noted that several colleagues started from different backgrounds. For her, she worked within the engineering team, but focused on backend performance, “an SRE-similar domain,” she said.

Whatever the route, grace under uptime pressures, a knack for streamlining out mundane, repetitive tasks, and a diplomatic streak are all prerequisites.

 

site reliability engineer squarespace commons
Common area at Squarespace headquarters. | Photo: Squarespace

Wait, Isn’t This Just Like DevOps?

From a bird’s-eye view, much of SRE is about automation and streamlining interactions between infrastructure and operations. That probably sounds similar to another dev-world practice that graduated from buzz phrase to broad adoption: DevOps.

To be sure, in Seeking SRE, a Microsoft cloud advocate plainly states that “reliability engineering and DevOps aim to solve the same problem set” (keeping digital services up, while adding improvements), while a Deswik Mining release engineer opined that “SRE and DevOps have a wide scope of overlap, but they are distinct ideas.”

One of the most prominent thought leaders in the field, former Google SRE Liz Fong-Jones once likened SRE to a concrete class that implements the interface in a programming language — “a prescriptive way of accomplishing that [DevOps] philosophy.” In other words, where you draw the line may differ.

“There aren’t necessarily formalized handoffs between product and operation and security. Instead, you have all these teams working together, sort of all at once.”

For Turner, it boils down to an aversion to silos. “It all starts from the notion that there’s a foundation of trust between teams and the best way to work is cross-functional,” he said. “There aren’t necessarily formalized handoffs between product and operation and security. Instead, you have all these teams working together, sort of all at once,” he said.

“There’s a lot of cross-pollination there,” between SRE and DevOps, he said.

That includes the container orchestration systems in which Turner specializes at Squarespace. From containers, our conversation veers a bit into programming languages and ecosystem dependency. Go offers an across-the-board ease-of-use, but it’s not very expressive, he mused. Haskell is more interesting, though maybe not a great fit to use professionally, he said. Rust, on the other hand, seems just a killer app away from seeing greater adoption for “both systems-level programming and higher-level application programming.”

“I’ve always been a bit of a programming-language nerd,” he said — a long way from primitive spell-checkers and MAX/MSP patches but still as opinionated and passionate.

Jobs from companies in this blog

Great Companies Need Great People. That's Where We Come In.

Recruit With Us