I help lead the charge on platform scaling as a web architect at Nearpod, an interactive instructional platform. With this past back-to-school period, we saw unprecedented growth. As the world found itself urgently in need of solutions to bridge the virtual learning divide, teachers and students rapidly flocked to us — and we had to rapidly address the sudden demand.
We make use of Redis, the well-known in-memory datastore, like many web applications do. We use it fairly extensively, and the majority of that use comes in the form of a sizable Redis Cluster, which we use to manage the millions of virtual classroom sessions that go on throughout the day. Our Redis Cluster runs within Elasticache, which is managed by Amazon Web Services.
In case you are unfamiliar, when an application connects to Redis Cluster, it needs to locate a working node in the cluster, and from that node it requests a mapping of slots that store your data across all the other nodes in the cluster. What’s important to note is that calls to retrieve the slots get slower the larger the cluster is.
Faced with all this pandemic-fueled growth, we immediately found ourselves scaling up our Redis Cluster to match the load of millions of new students joining classes every day. Before long, we scaled all the way up to the limit of nodes that Elasticache allows by default.
Scaling Up to Meet Growth
As student activity continued to grow beyond anyone’s wildest expectations, I gathered the team to discuss how we could stay ahead of the game. Out of an abundance of caution, we opened discussions with Amazon to increase the limit for the maximum number they’d allow in a cluster. While Amazon agreed, it took several tries for them to determine how to fully enable this, which told us what we already suspected: We were running one of, if not the, largest Redis Clusters in Elasticache and AWS.
After the limit increase was applied by AWS, we noticed something very peculiar. Our node hosting shard 46 was experiencing significantly less load than the first 45, despite it serving the same amount of data as the other servers.
Puzzled at why this would be, I drilled into the problem, pouring through the performance data provided by Redis Cluster. Ultimately, it came down to two issues:
- One very detailed (and potentially costly) nuance specific to the open-source code we use to access Redis.
- And one bug in how Elasticache handles clusters beyond their limit, which helped expose the first issue.
Identifying the Problem
As we investigated, the first thing I did was look to see if there was a difference in the workload that was hitting the new nodes. The key and slot metrics looked identical to the other nodes. We ran the classic Redis MONITOR and INFO commands, but we found no indication as to what was occurring through the usual tools.
Someone on the team asked: “Is there anything like the classic Linux TOP command for Redis?”
This was the start of an aha moment. Redis does indeed have some additional internal CPU metrics that aren’t exposed by default. Those metrics are accessible with Redis’ INFO COMMANDSTATS command, which shows how many calls have been made to each type of Redis command, along with the total and average times.
Running this command, shock set in. I could suddenly see that, across the first 45 nodes, the cluster slot commands accounted for an absolutely astounding 87 percent of the CPU time, versus virtually zero on the new nodes.
Finding a Root Cause
I knew we had two questions to answer next:
- Why are our nodes spending more time on the cluster commands than on doing actual application work?
- Why wasn’t shard 46 and beyond doing the same?
After a closer look, the answer to the first question turned out to be quite simple, but hidden deep within the nuance of the open-source client we use to interact with Redis Cluster.
While we intentionally use persistent connections to Redis, the default behavior of the PHP client, PhpRedis, is that it doesn’t actually keep a copy of the slot mapping between different instances of the client. That meant that, every time we created a new client, it would also ask for all the cluster slots again. The solution is a configuration setting to keep a local copy of the slots that persists across connections, alleviating the need for all the extra calls.
With that enabled, the CPU usage on our Redis nodes dropped almost by half. And, thanks to eliminating the wait for the additional command, we saw a welcome decrease in our application response times.
The second question required more precision and intricacy within our cloud infrastructure.
There are many ways that you can do service discovery to locate a working node in order to bootstrap your client. The Elasticache implementation is to attempt to return all nodes in a single DNS response, in what is a very long list of A records.
The problem with this approach is that there is a limit to how big a DNS reply can or should be, and eventually the list would get cut off. As you might have guessed, in our case, this caused nodes 46 and above to never be a node the client connects to, which in turn means they never get Redis Cluster slot commands. Ultimately, this caused the significant difference in load experienced by those nodes.
For us, solving the first part of the problem and drastically reducing the CPU utilization stolen by the costly and frequent cluster slot commands meant we could aggressively begin reducing our cluster size, making the node discovery issue moot — for now. But knowing the solution to the second pitfall that still lies ahead as we continue to scale certainly makes us feel more comfortable.
How to Avoid Cluster Roadblocks
If you are running Redis Cluster at scale, here is my advice for avoiding the challenges we faced while pushing our cluster limits:
- Ensure your Redis client — particularly if you are using PHP — is caching the mapping of slots within your cluster. Otherwise, the hidden performance and CPU penalties you could pay are daunting.
- If you are using Elasticache for your Redis Cluster, which is very likely if you run your applications in the leading cloud provider, AWS, then be aware of how it allows for service discovery, and make sure you aren’t subject to its limits.
Today, my team and I are continuing to scale the platform and serve an ever-increasing number of teachers and students, while managing our Redis Cluster with much more ease following the hurdles. It’s my hope that, equipped with the knowledge above, you too will be able to push the limits of scaling Redis Cluster, while avoiding the pitfalls we found along the way.