How a Brilliant Platform Engineer Saved Twilio From ‘Absolute Disaster’
In 2013 Twilio was in rapid growth mode. We’d gone from about $1 million in annual revenues in 2010 to more than $30 million in 2012. We’d raised four rounds of venture capital, totaling $103 million, and grown from three founders to more than a hundred employees, more than half of whom were software developers building our products.
But we had a problem. Our “build systems” — the software infrastructure that those roughly 50 developers used to submit their code to our repository, run tests on it, package it up in a deployment, and deploy it to our main production servers — was showing its age. I had built that system in 2008 when we started the company, and it was never designed to support 50 engineers all submitting code all day long and then deploying it to hundreds of servers.
When I built it, I could commit my code and have it running on a server in five minutes. By 2013, because of the growth of the codebase and the complexity of the tests and builds, the process was sometimes taking as long as 12 hours! Not only that, but the build would actually fail a substantial number of times — at worst, up to 50 percent of the time — and the developer would have to start over again. We regularly lost days of productivity just getting code out. This was the opposite of moving fast.
Writing the code wasn’t the hard part. Wrangling our antiquated systems was. Talk about a self-inflicted wound. As a result, our best engineers started quitting, frustrated at the inability to do their jobs. At first it was a few, and before we knew it, nearly half of our engineers had quit. Half! It was an absolute disaster, and it almost tanked the company.
And so we embarked on a rapid, and painful, plan to rebuild our developer platforms to support our growth. Our first move was to hire a guy named Jason Hudak to head the platform team.
Jason had worked at Yahoo for more than a decade, building the infrastructure to support its thousands of engineers. Jason is probably not what you imagine when you think of a software engineer. He’s a ruddy-faced Texan and a former Marine. He went to Texas Tech and studied business, not computer science. He’s more or less self-taught, having learned to write code after landing a job at a tech company in the 1990s and studying alongside engineers who recognized his potential. Jason spends his free time snorkeling, cycling, and hunting wild boars in Texas. He’s also an accomplished abstract painter. He’s gifted me two pieces of his art, which hang proudly in my office. He comes to work wearing T-shirts, flip-flops, and trucker caps. But beneath his easygoing manner there’s an intensity and discipline that he learned way back in Marine boot camp.
This combination was crucial as we started to embrace DevOps in building our developer platform.
A cynic might say DevOps has become kind of the “flavor of the month” for software development, the way Agile and Lean Startup did before. Amazon lists more than a thousand books on the topic. You could spend years learning everything about DevOps, but for our purposes I’m going to provide an extremely simplified explanation:
Once upon a time, software development organizations broke the process of producing a piece of code into multiple roles. Tasks like coding, building, testing, packaging, releasing, configuring, and monitoring were handled by separate people. Developers wrote code, then handed it off to quality engineers, who found the bugs. Release engineers got the code ready for production. Once people were actually using the program, site reliability engineers (SREs) were tasked with keeping it running. SREs were the ones who “wore the pager,” meaning they were on call at night, or on weekends, and were expected to drop everything and fix the code when a program bonked.
Breaking work into specialized roles had certain advantages, but it also slowed things down. Developers would toss code over the wall to the quality engineers, who would bash away on it and send it back for fixes. That process would go back and forth, and through several different kinds of testing. Then the code would go to the release engineers, who might toss it back, and then to site reliability engineers, who also might toss it back. (You can tell I’m not a fan of throwing things over walls.)
At each step there could be delays as a developer waited for a test engineer or release engineer to finish other projects and then get to theirs. Multiply all those potential delays by the number of steps, and you can see how things could get bogged down.
DevOps, first conceived about a decade ago, represents an attempt to speed things up by having one developer handle all of the steps. The concept is reflected in the name itself: instead of having “developers” who write code and “operators” who do everything else, you combine all of the duties in one person. In a DevOps environment, the same developer writes the code, tests the code, packages it, monitors it, and remains responsible for it after it goes into production.
That last sentence conveys one of the most important elements of modern software development and something that we at Twilio consider to be almost a sacred value: the person who writes the code also “wears the pager” for that code after it goes into production.
It’s your code. If it crashes, you fix it. We like this idea because it pushes developers to deliver higher-quality code. The dread of taking those middle-of-the-night phone calls provides a little extra incentive to take another pass through your work before you ship.
It’s not as though we permit teams to ship code that’s constantly crashing, even if they’re the ones waking up to fix it. Customers would still be impacted. So Jason and his team created a checklist of best practices called the Operational Maturity Model (OMM). It consists of six categories of excellence: documentation, security, supportability, resiliency, testability, and privacy. In total, there are 41 steps.
And here’s the catch: In order for teams to consider their product generally available (GA), meaning it’s ready for mission-critical customers, they have to demonstrate excellence in each category. Achieving a perfect score across the board is the highest level of achievement. We call it “Iron Man.”
In the traditional model, developers perform only some of those practices. Maybe they write some tests, but not full end-to-end tests. Maybe they document their code, but don’t enable the support team. Maybe they have good security practices, but not privacy practices. It’s not that they don’t care; they’re just not versed in what excellence looks like. The best way to get good at these things, of course, is for teams to automate. Yet if every team has to become domain experts, and build their own automation for each of these categories, it would take forever. That’s where Jason’s team comes in.
Jason defines his job, and that of the platform team — a group of about 100 engineers across 13 small teams — as “to provide software that will enable a traditional software developer to be successful in a DevOps culture without having a deep background in all of these specialized disciplines.” They don’t develop software that ships to customers. They make software that developers use to write, test, deploy, and monitor software. If anything about our process resembles an assembly line, this is probably the closest thing. Platform engineers are the people who design and optimize the “assembly line” that speeds innovation.
We wanted to make it easier and faster for developers to write code that achieves operational maturity with as little work as possible. Our solution was to build a platform that provides all of those functions in one place. Jason likens it to a big stained-glass window, a single pane of glass with many elements. Developers can access all of the tools they need through that single pane of glass.
Their standards are high. “Software engineers are the most cynical, critical, curmudgeonly bunch on Earth,” Jason says. “I can say that because I am one of them. They’re intellectually honest, but you get the most brutal feedback. The reason I’m building platforms is if you can build software that makes other software engineers happy, you can build software for anything.”
When Jason joined Twilio, he drew up a list of principles and values to inform the way he builds and runs the platform. He had to walk a tricky line, striking a balance between giving developers freedom and autonomy while also persuading them to adhere to a set of standard ways of doing things. The standards help us have cohesion in almost all the parts of the codebase. But we don’t want to be so rigid that we stifle innovation. We’re constantly trying to get that balance right.
Here are the principles he landed on:
The Paved Path
The Admiral developer platform includes all of the tools a developer needs. But developers don’t have to use them. If you love a particular testing tool and it’s not in the platform, you can still use it. Jason calls this “off-roading” versus the “paved path,” meaning if you want to use the tools we’ve chosen, your life will be easy, like driving on a paved road. However, you’re free to go off-road and drive through the brush and dirt roads too. You’ll still get to where you need to be, but it might take longer. But if it’s really that important to you, or if that special tool gives you some advantages, by all means go for it. One of Jason’s favorite expressions is “We don’t have rules — we have guardrails.” But if you go off-roading, you’re still on the hook for things like security and resiliency, which makes the paved path look all the more attractive.
Choose Your Language
Another example: We don’t force developers to use only one language. Instead, we support four languages — Python, Java, Scala, or Go. A developer can use any one of those four languages and still get a fully supported platform. But as with tools, developers have permission to choose other languages too. But again, it’s about driving on the paved path versus going off-roading. “If you want to build something in C or some other language, by all means do it, because we’re not here to tell you what you can or can’t do,” Jason explains. “Just know that you may have some heavy lifting to do, because you won’t be able to use all of these tools in the platform.”
The goal is to provide developers with a menu and let them pick and choose what they want, whenever they want, without having to go through any gatekeepers. They also don’t need to know how those processes work. They just choose what they want. It’s like pressing a number on the vending machine and getting a Diet Coke. You don’t care how the machine does that. “Developers just tell us what they need done, and we don’t want them to care about how it gets done. You just tell us what you want, and we’ll take care of that for you.”
Opt in to Complexity
Admiral is set up so that each tool has a specified way of doing things — “an opinionated workflow,” Jason calls it, meaning the platform engineers have certain opinions about the best way to use this tool. But, once again, developers don’t have to follow those rules. “We allow developers to configure the software to perform more complex activities, or even to use the software to do things we hadn’t considered when we were building it. Our mantra is ‘The common should be easy and the complex should be possible.’”
Behave Compassionately but Prioritize Ruthlessly
“We never like to say no,” Jason says. “But if one team has a request for something that would be cool to do, and another team has a project that will unlock $90 million in recurring revenue for the company, we’re going to solve that one first and put the other request on our backlog.”
Composable Over Monolithic
Our software is based on a microservices architecture composed of hundreds of microservices. Each microservice performs a single function or capability. The advantage of microservices is that we can route around or absorb a failure. If one service fails, it won’t bring down the entire Twilio voice system, for example. The services are all loosely coupled. They’re all built by different teams, which can work independently. One microservice might be version one or two, and another might be on version five. But as long as they all “speak” to the API that connects them, that’s fine.
Platforms: The Software That Makes the Software
At Twilio we’ve spent years incrementally building the “machine” that produces our software — the Admiral platform that Jason Hudak and his team designed — saving a little bit of time here, a little bit there. I’m going to try not to get too far down into the weeds here, but I want to spend time describing the way this process works because it’s so important to any modern software organization. A good platform will radically slash the time it takes for developers to get new code into production, letting fewer developers produce more code in less time.
The Admiral platform is based on the concept of “pipelines” — the process that kicks off when a developer commits new code. Every team has the ability to customize its pipeline based on the unique aspects of its product, but also for its working style — thus enabling autonomy. But there are several default, preconfigured pipelines that teams can start from. These represent the most “paved paths” for standard workflows, such as for websites, microservices, or database clusters.
A typical pipeline starts by running unit tests — the most basic kind of code tests that developers write. Then it runs more sophisticated tests, like integration tests, which test how the software interacts with other services it depends on. Passing those, the code runs through “failure injection testing,” simulating real-world scenarios in which computers fail, such as network outages or hard disk failures. Then come the load tests, testing what happens when the volume of requests spikes, as well as durability testing, simulating sustained high loads to find memory leaks or other issues that arise only after a long period of stress.
Passing all of these, the code moves into the “staging” environment for another set of tests — this is a complete copy of our real-world system, but used only for internal testing. Finally, if all is going well, the code is moved to the “production” cluster — the systems our customers actually use. The rollout to production, though, isn’t instant.
Typically the code is phased in via a “canary deployment,” as in “canary in the coal mine.” A small percentage of requests are sent to the new software, and that percentage is slowly ramped up over time if no issues arise, until the new code is handling 100 percent of the production requests. If at any point issues are detected, the old code is rotated back in and engineers are notified so they can investigate the problem.
For most teams, this entire process is now automated. As you can imagine, doing this work manually would be excruciatingly slow, tedious, and error prone. But in reality, when this isn’t automated, most teams will omit many of the steps, which incurs risks. The paved path is a powerful idea. Because so much of this infrastructure is ready and waiting, doing it right is also relatively easy. That allows teams to move quickly and confidently.
However, as awesome as I’ve made Admiral sound, teams are not required to use it. Small-team autonomy means that they’re not forced to use a particular tool if they don’t want to. Instead, they choose to use it. So Jason, like anybody “selling” a product, has to win over his customers: the internal developers at Twilio. That’s where his principles really come into play.
With preconfigured pipelines, Admiral makes it easy for standard types of services to be built and deployed. However, for teams to adopt the tool, they have to be able to dive in and make changes if needed. Otherwise, they’d have to build their own tooling outside of Admiral, and lose the benefits. That’s where one of Jason’s other principles, opt in to complexity, comes into play. While teams can take the default settings, they can also dive into the bowels of Admiral, and rewire it for the particulars of their project. Don’t like the default unit testing framework? Developers can plug in their own, while still keeping all the benefits of Admiral and the rest of the pipeline. The same goes for all components.
This gives teams autonomy to pick their tools, while making the defaults easy and attractive, helping to encourage adoption of Admiral. As of today, 55 percent of all deployments use the full pipeline functionality of Admiral. Most of the rest use parts of Admiral, but not the entire thing. And those numbers are growing all the time.
The False Dichotomy: Fast Versus Good
The cadence of software innovation is faster than ever before. Turning customer insights into products is happening at a lightning speed in this digital era. Yet there’s often this question of whether teams should move quickly to capture opportunities and respond to customer needs, or whether they should move more cautiously, ensuring that everything works properly, scales well, and is bug-free.
However, at really good software companies, this is a false dichotomy. Platforms like Admiral are what enable developers to quickly develop high-quality code, and move it to production with confidence that they aren’t breaking the customer experience with every code deployment.
Jason’s number one mission is to speed up every engineer at Twilio, while ensuring they meet the demands for quality, security, and scalability. Instead of six months, can we deliver a new feature in six weeks? Six days? Six hours? Jason reckons the platform does 80 percent of the work that a developer previously had to do. Some processes that previously took weeks or even months now can be done “with a few clicks in a few minutes.” Today, Twilio releases new code to a production over 160,000 times per year — that’s nearly 550 times every single working day.
Bite the Bullet
Some people might push back on the idea of spending money on infrastructure teams. We’ve had this argument nearly every year in our budgeting process. It’s easy to get pulled into the trap of hiring more and more developers who work on customer-facing products, because that return feels more immediate — and their work translates more apparently into revenue. But infrastructure engineers make your entire development team more efficient. “Platforms are a force multiplier,” Jason says. “It’s like a fulcrum. For every dollar I put in I can return five dollars.”
Here’s an example. In 2018, it took our developers 40 days to develop a new Java service. We wanted to speed things up. In theory, we could hire twice as many engineers, and they would produce twice as many services per year, right? (In fact, doubling our developer head count would not double our productivity, but for the sake of argument let’s pretend it would.) But that would mean hiring hundreds of new developers.
Instead, Jason grabbed two platform engineers, and they automated a bunch of steps in our development process. Their work slashed development time in half — from 40 days to 20 days. The impact gets magnified because we develop about 200 new Java services per year. Yes, we spent money on those two platform engineers. But their work saved us 4,000 person-days per year. That’s the argument for spending money on infrastructure instead of hiring more product developers.
Instead of focusing on how much it costs to build a platform team, focus on the return those platform engineers can deliver. But also realize that these investments take time to pay off. Not only do you have to build the team, and have them build the infrastructure; the other teams need to adopt it. This cycle takes time, but it pays back in spades over a multiyear investment. It truly becomes a source of competitive advantage.
When we do hire new developers, they come up to speed much faster thanks to Admiral. “A few years ago it took us four months to get new engineers trained to the point where they could be a contributing part of a team,” Jason says. “Today we can have them developing in a week.” Again, it’s all about the return on investment. Platform engineers punch way above their weight.
As huge as our gains have been, though, we think the platform can make even more dramatic improvements in speed. Jason wants to get the Java deployment process that got cut from 40 days to 20 days down to one day — or even just a few hours. One of his 13 teams is focused solely on optimizing the platform itself in these ways. They study how developers use the product, searching for places where developers get stuck or slowed down, and eradicating them. To measure the time developers spend fiddling with tools, Jason created a metric called Time Spent Outside Code (TSOC). Our average TSOC might never get to zero, but the goal is to get as close as possible.
“The future of platforms will be allowing software developers to focus only on their features and their customers,” Jason says, “and not about all the underlying systems that are required to bring software from somebody’s head to the cloud to a device and to an experience for a customer.”
* * *
From the book Ask Your Developer: How to Harness the Power of Software Developers and Win in the 21st Century. Copyright © 2021 by Jeffrey Lawson. Reprinted by permission of Harper Business, an imprint of HarperCollins Publishers.