The Cloud’s Kernel of Optionality
One of the most interesting enterprise cloud deals in 2020 has got to be Zoom’s cloud expansion with Oracle Cloud Infrastructure, in which the three major “hypercloud” providers (AWS, Azure, and GCP) were completely bypassed in favor of an unlikely underdog. Naturally, speculation on why Zoom made this choice takes you in a multitude of different directions. Kevin Xu wrote a great article detailing three separate reasons why Zoom might find OCI beneficial. Those reasons all make a lot of sense.
Yet, in my opinion, the best explanation is one metric mentioned by Corey Quinn: outbound network bandwidth costs. Oracle’s bandwidth costs are 10 times cheaper than AWS, and Zoom’s comparative negotiating strength with Oracle as compared to AWS may mean a much better discount on that price. The subjectivity of how developers may feel about Oracle could boil down to this: “Using OCI costs more developer man-hours than using AWS.” While this may be true to some degree, I doubt it justifies millions of dollars in savings per month that could make or break a company’s ability to pay wages and rent while remaining profitable.
Zoom’s ability to get good deals rests on its ability to consider good deals. The more vendors you can choose from, the more negotiating leverage you have, since you can simply walk away from a deal you don’t like and choose a different vendor. Since every cloud vendor offers a different value proposition, including different cloud services, building a solution deployable on multiple different clouds involves thoughtful, extensive preparation at the system design level. That’s why it’s so important to be mindful of what cloud services you use when building your tech stack.
Using cloud services can present tricky tradeoffs. Managed offerings can save precious labor-hours during product discovery and pivots. Unfortunately, they may also present black boxes that affect not just how you get billed but also what choices you have about future billing, among other concerns. Take a look at this conversation around GCP’s Google Kubernetes pricing as an example, in which one user laments the impact of pricing on the optimal system architecture.
I’ve found several ground rules helpful. First, the lower you design for on the stack, the higher the commonality between vendor offerings. Many more companies target a specific computer architecture than target an application-specific API because, at some point, everybody has to use the same primitives. By way of example, if you’re happy eating any kind of red meat, you can buy your groceries at pretty much any supermarket. If you must have imported wagyu for dinner, however, your options will be much more limited. Since everyone has to build services from primitives, the earlier a service is released by a vendor, the closer it sticks to those underlying first principles available to everyone. As a result, there is a higher probability that other cloud vendors have also executed toward that product direction.
One final exercise is asking what the on-premises alternative to this solution might be. In my experience, the optional cloud-based stack involves managing your own network layer, VMs and object stores, which on AWS maps to VPC + EC2 + S3. Here’s what I’ve found valuable to know about each of these services.
My impression of the big benefit behind managing your own VPC is that you get to control your own security in a way that’s replicable outside of any one cloud provider. As a first-order approximation, you can set up a service under a private subnet with a route table and a NAT gateway, and because of how the internet works, no request from the internet can reach you. This type of set-up is pretty powerful, especially for deploying internal services that don’t need to listen to the Internet but still have to fetch updates from remote servers. Since you never want to roll your own security, setting up a VPC could be a useful alternative to paid-for and vendor-specific security products, though application-level security logic and choices may differ.
A VPC is a virtualized packet switcher, in which case the open-source alternative might be something like VMWare’s Open vSwitch. Cloud-based VPC services are extremely common and highly standardized. In an industry rife with monikers, AWS, GCP, and Azure all call their network virtualization service “VPC.”
I manage my VPCs on AWS using AWS CloudFormation, and I found this article very enlightening in terms of beginner-level execution.
The VM is still the key to cloud because not only does it separate how many compute instances you can run on bare-metal, you can run any kind of software on the underlying hardware. This massively increases your optionality, because if you define your system design at the EC2 level, all you need to move to a different platform is to update your build pipelines to target said platform. You can target enterprise customers integrating with Windows Server 2003 for large enterprise deals, and you get the latest kernel patches from upstream Linux for the best OS-level security guarantees.
The key to virtual machines is the hypervisor, which is the infrastructure virtualization layer separating a VM and the host operating system. The hypervisor virtualizes your underlying device drivers. If you need specialized devices, like a GPU or FPGA for video transcoding or AI/ML, looking into hypervisors may be a good bet to find tricky performance or reliability gains, and avoid costly managed AI/ML solutions. There are many different hypervisors, and some of the best are open-source. KVM is one such hypervisor commonly used across the board, which enables the Linux kernel to act as a hypervisor and is baked into the Linux kernel itself. I personally don’t see much danger of vendor lock-in at the hypervisor level. In fact, AWS recently switched to KVM for a portion of its EC2 service.
One tool I’ve regretted not using more during my DevOps journey is Packer, which was described in Docker on AWS by Justin Menga. Packer defines “Automated Machine Images,” or AMIs, using data like JSON files, at different points during the AMI’s lifecycle (pre + post processing). You can pre-define an AMI and ship an image just as you can an ISO file, publish it to a registry, and pull it down during EC2 creation time. Knowing you don’t even need Internet access to ensure your VMs are up-to-date will save you a lot of headaches. By contrast, I just define AMI configurations directly within CloudFormation, which currently doesn’t allow direct relative configuration file imports and can only check AMI status during runtime while bundled with other cloud resources.
S3 is the most “cloudy” of this trio. It's a vendor-specific managed offering with no clear on-premises alternative. But there’s a good reason why you should still adopt it, or something like it, as part of your stack. First, UNIX-based systems treat files differently from processes. Second, files in general grant more flexibility than live processes do; you can imagine replicating a file is less fraught than replicating a process, with its state requirements. Lastly, and perhaps most importantly, resilient data remains critical to a company’s ability to remain in business, and it’s worth paying a premium for resiliency guarantees. Engineering this guarantee comes with different design requirements and much stricter standards for certain metrics. The amount of work it takes to achieve those metrics isn’t feasible for your average SMB. I know that I can’t do it by myself.
The good news is that there’s no shortage of object storage services available, which means prices have to remain competitive. S3’s prices are some of the lowest out of all AWS services, not because it’s a loss leader, but because it’s built efficiently and because the supply of vendors forces down market prices.
Even better, as an HTTP-based service, S3 is only the data origin of what can become a larger pipeline. I personally use S3 as the retrieval source for CDNs like AWS CloudFront for any number of front-end web clients. Not only are there many different CDN offerings on the market, it’s far cheaper than having a dedicated server to host your files, since each CDN multiplexes many requests from many different users, which amortizes server costs. You get the best of both worlds: cloud-scale availability, and you can commoditize your complements.
I’m not sure if there’s really a viable open-source alternative to S3. The closest solution I know of might be OpenStack Swift, but if that’s even 1 percent of the work AWS puts into S3, I’d be very surprised.
All that being said, I think it’s important to mention some services I would avoid or replace when possible. I personally prefer servers over serverless, because I prefer to clearly know what my billing cycle will look like ahead of time instead of billing per function run, and because I maintain the environment where my stuff runs. For more complicated workflows, like deploying your own database, serverless isn’t yet an option I can consider for stateful tasks like provisioning storage. There’s a place for serverless; I use it for static site authentication, and I like how RDS uses serverless to automatically rotate secrets. For the most part, though, servers beat serverless when it comes to having more options. Consider deploying Firecracker for a self-managed functions-as-a-service offering.
One surprising concern that I had was running containers in the cloud. I like containers because it bundles application code for a complete runtime, which is really important for things like consistently shipping projects depending on underlying cross-language bindings. From my ops experience, though, containers did add some degree of misdirection. As somebody who likes the secureness of nocode, containers also add another layer of abstraction on top of VMs, where things can go wrong. Finally, containers may not have the same level of driver support that VMs do and present a different set of tradeoffs in areas like application security. I use containers for convenience’s sake, but I appreciate how they may reduce my options.
Lastly, I think if there is an argument between infrastructure-as-code platforms like Terraform versus CloudFormation, it is a bit overblown. I personally use CloudFormation as I’m all-in on AWS at the moment, it has a nice GUI to manage stacks in production, and it integrates well with AWS developer support. In the end though, it’s just data, and open data is far easier to manage than any specific cloud service. It’s also not mutually exclusive; if need be, Terraform can deploy CloudFormation stacks.
Most software engineers serve the needs of others, and the more flexibility we can bring to the discussion table, the more alignment we can hope to see. My experience with DevOps has taught me system design is a critical part of gaining that flexibility. Sticking with well-understood services instead of going with an easier option results in more upfront pain but can help realize possibilities that last. I like to think I’m somebody who places a premium on the values of independence and hard work. To that end, I find gaining this optionality very much worth looking into, and I don’t think I’m alone in that regard.