As technology continues to evolve, users have increasingly less patience for when it performs poorly. So, in order to provide the best possible product for users, engineers need to follow the right staging environment practices.
It benefits any team utilizing a staging environment to have a set of best practices they follow. But what works best? For three companies, the approaches differ.
Engineers at luxury retailer Nordstrom follow a continuous delivery system that allows them to test and deliver the various configurations in development at once, Director of Engineering Josh Maletz said. At healthtech company Hear.com, engineers use an isolation method to test and break lower environments in order to get the most utility out of them, DevOps Engineer Jack Cusick said. At healthtech company GRAIL, staging and production environments are treated equally, and, therefore, the methods for deployment are the same, said Zach Pallin, a senior DevOps engineer.
“If your staging environment has a unique or more informal release method, your team will be unprepared for issues in automation and operations when releasing to production,” Pallin said.
Each approach has its benefits when it comes to developing staging environments. Built In chatted with three engineering leaders in order to gain a better understanding of the staging environment best practices their teams implement.
Staging Environment Best Practices
- Remember to use staging in the first place, it's forgotten often
- Production and staging deployments should match—what works for staging works for production
- Continuous delivery is your friend
- Like anything else, communicate with your team
GRAIL
Zach Pallin
SENIOR DEVOPS ENGINEER
What’s a critical best practice your team follows when developing staging environments?
The way in which the teams deploy staging and production must match.
Everyone knows that staging should match production — that the data, third-party software versions and any related infrastructure should all be roughly the same. Unfortunately, I’ve noticed that it’s easy to forget that deployment is part of that equation too. And if your staging environment has a unique or more informal release method, your team will be unprepared for issues in automation and operations when releasing to production.
At GRAIL, our DevOps team is currently working on the release of Galleri. For this project, all of our containers are deployed with our Buildkite job runner and powered by Helm to be deployed to Kubernetes. All environments are deployed in the same way and our software engineers use the same procedures when deploying. If the pipeline works for staging, we know it will work for production.
Alerts and graphs need to be tested and demonstrated to work — what better place to do it than an environment that is intended to match production?”
What processes does your team have in place for monitoring and maintaining the staging environment?
Prometheus and Grafana are now the accepted standard in the Kubernetes world and that’s what we use, too. Our services all export Prometheus metrics, which are then gathered and read into Grafana Cloud. From there, teams are responsible for setting up graphs and alerts for all clusters, including staging and production. Software and DevOps engineers collaborate to create monitoring and alerts that serve our needs to maintain high availability for our services.
Without observability, it’s impossible to guarantee the stability of our critical infrastructure and this extends to our staging environment naturally. If staging is down, it impacts our ability to deploy, but can also cost our company time and effort to troubleshoot. Furthermore, the alerts and graphs themselves must be tested and demonstrated to work — what better place to do it than an environment that is intended to match production?
What’s a common mistake engineering teams make when it comes to staging environments?
I would like to believe that most engineering teams have resolved some of the yesteryear complications, such as data not matching production or out-of-date third-party software versions. However, a lesser-acknowledged issue is that most engineering teams forget to actually use their staging environment.
It’s easy to forget that staging is there! Most developers will spend their time in development environments or troubleshooting production. But often it’s just the managers, QA and operations teams who ever really look at staging because staging is most often used for validating software versions for production deployment. Engineers need to hop in and stay with their services throughout. If you have access to monitoring dashboards for your clusters, make sure you stay up to date with the state of your software in the staging environment, too. Check your logs after you deploy and don’t wait for people to come hounding you for answers when something breaks!
Hear.com
Jack Cusick
DEVOPS ENGINEER
The engineers at Hear.com, which provides customers with hearing care, use an isolation method when developing staging environments. One of the results? “Isolation gives our engineers the confidence that any outages or stress testing occurring on staging won’t affect our end customers,” DevOps Engineer Jack Cusick said.
What’s a critical best practice your team follows when developing staging environments?
Isolation is a best practice that our team follows for our staging environment. We use different Kubernetes clusters and AWS accounts for our staging and production environments. There are myriad benefits of this practice, but one worth highlighting is a reduced blast radius. Engineers need to feel comfortable testing and breaking lower environments to get the most utility out of them. Isolation gives our engineers the confidence that any outages or stress testing occurring on staging won’t affect our end customers.
What processes does your team have in place for monitoring and maintaining the staging environment?
Our team monitors our staging environment with an elegant combination of Kubernetes, New Relic and Slack. However, the tools themselves aren’t terribly important. What’s most important is that we monitor staging in the same way that we monitor production. Our monitoring, alerting and (on our best days) our gusto for troubleshooting staging outages is identical to production. This ensures we catch as many bugs as possible before production.
What’s most important is that we monitor staging in the same way that we monitor production.”
What’s a common mistake engineering teams make when it comes to staging environments?
A common mistake teams make when using their staging environment is incorporating it too late and too briefly in their release cycle. Oftentimes, code is rolling out late in the sprint under tight deadlines. This can lead to corners being cut during QA, smoke testing, etc. In these situations, a staging environment is not used to its fullest and may even do more harm than good. If you’re rolling quickly from testing to production, it may be time to revisit why you have staging and what you use it for. If necessary, look into development processes that pair well with light staging environments, like continuous deployment.
Nordstrom
Josh Maletz
DIRECTOR OF ENGINEERING
For Nordstrom, a luxury fashion retailer that has more than 100 stores and a sizeable e-commerce footprint, there are a lot of systems that need support. As a result, the engineering team at the company follows a continuous delivery system that allows them to test multiple types of configurations in development at once, Director of Engineering Josh Maletz said.
What’s a critical best practice your team follows when developing staging environments?
Due to the variety of systems that we support, we are sticking to the basics and believe that our most critical practice in developing staging environments is continuous delivery. We support the credit services organization in Nordstrom, so we support a wide range of systems. We have staging environments for 20-year-old monolithic applications with hard-wired dependencies to vendor test environments. We also have serverless applications with infrastructure-as-code deployed to the cloud where we employ robust service virtualization techniques to exercise our system against simulated erratic behavior.
When battling these types of dependencies, we seek to test in isolation while also needing deep integration testing. We need flexible staging environments to support various configurations and ways to simulate vendor behavior. The practices supporting continuous delivery — deployment pipelines, automated testing, continuous integration, etc. — provide the optionality we need to build our staging environments for our different types of applications. Having our teams adopt the mindset of continuous delivery — focusing on getting fast feedback on new changes to the systems — has given us clarity to focus on learning how those changes affect our environments.
What processes does your team have in place for monitoring and maintaining the staging environment?
We use the same logging, monitoring and telemetry tools in our staging environment that we use in production. This allows us to track how new changes affect the performance of the systems both in isolation and as part of the workflows. We can see if we just introduced a new bottleneck, if the validity of the system degrades or if systems halt properly if dependencies are not available, and ensure the proper alerting occurs when needed. We use our staging environments for learning and getting fast feedback on our latest changes, which includes knowing our integration and the use of our health support systems is working, too. There is an extra cost associated, but the peace of mind gained from this insight is well worth knowing your systems will perform for your customers.
Having our teams adopt the mindset of continuous delivery has given us clarity to focus on learning how those changes affect our environments.”
What’s a common mistake engineering teams make when it comes to staging environments?
Our staging environments tend to be shared with other teams — business, product, analysts, compliance, etc. These teams may be using the staging environments to evaluate the latest changes or may be getting a demo from the team. We’ve had issues where we will build feature toggles to support the release of a feature, or where we have methods for supporting zero-downtime deployments, but we don’t use them in staging — this is a big mistake.
When pushing changes this fast, we need to ensure the same tools we use to not disrupt the customer experience in production are also employed in the stage when we have possible ‘customers’ using the system. Not doing so only creates confusion and undermines our partners’ confidence in our systems. If one needs to test changes in isolation, that is achievable using sandboxes per developer or other methods. If you can get to a place where you can build a brand new environment on demand for a single developer, that means you can build a new environment on demand for anyone, including for staging. As it is with all things, communication is key, and using the tools we already have for moving fast in production should be used for staging as well.