The Staging Environment Best Practices This SF Engineer Swears By

Does your staging environment mirror your production environment? If not, why not?

At healthtech company GRAIL, staging and production environments are treated equally, and, therefore, the methods for deployment are the same, said Zach Pallin, a senior DevOps engineer.

“If your staging environment has a unique or more informal release method, your team will be unprepared for issues in automation and operations when releasing to production,” Pallin said.

Built In SF chatted with Pallin in order to gain a better understanding of the staging environment best practices he and his team implement.

Zach Pallin

Senior DevOps Engineer • GRAIL

What’s a critical best practice your team follows when developing staging environments?

The way in which the teams deploy staging and production must match.

Everyone knows that staging should match production — that the data, third-party software versions and any related infrastructure should all be roughly the same. Unfortunately, I’ve noticed that it’s easy to forget that deployment is part of that equation too. And if your staging environment has a unique or more informal release method, your team will be unprepared for issues in automation and operations when releasing to production.

Our DevOps team is currently working on the release of Galleri. For this project, all of our containers are deployed with our Buildkite job runner and powered by Helm to be deployed to Kubernetes. All environments are deployed in the same way and our software engineers use the same procedures when deploying. If the pipeline works for staging, we know it will work for production.

Alerts and graphs need to be tested and demonstrated to work — what better place to do it than an environment that is intended to match production?”

What processes does your team have in place for monitoring and maintaining the staging environment?

Prometheus and Grafana are now the accepted standard in the Kubernetes world and that’s what we use, too. Our services all export Prometheus metrics, which are then gathered and read into Grafana Cloud. From there, teams are responsible for setting up graphs and alerts for all clusters, including staging and production. Software and DevOps engineers collaborate to create monitoring and alerts that serve our needs to maintain high availability for our services.

Without observability, it’s impossible to guarantee the stability of our critical infrastructure and this extends to our staging environment naturally. If staging is down, it impacts our ability to deploy, but can also cost our company time and effort to troubleshoot. Furthermore, the alerts and graphs themselves must be tested and demonstrated to work — what better place to do it than an environment that is intended to match production?

What’s a common mistake engineering teams make when it comes to staging environments?

I would like to believe that most engineering teams have resolved some of the yesteryear complications, such as data not matching production or out-of-date third-party software versions. However, a lesser-acknowledged issue is that most engineering teams forget to actually use their staging environment.

It’s easy to forget that staging is there! Most developers will spend their time in development environments or troubleshooting production. But often it’s just the managers, QA and operations teams who ever really look at staging because staging is most often used for validating software versions for production deployment. Engineers need to hop in and stay with their services throughout. If you have access to monitoring dashboards for your clusters, make sure you stay up to date with the state of your software in the staging environment, too. Check your logs after you deploy and don’t wait for people to come hounding you for answers when something breaks!

GRAIL is Hiring | View 16 Jobs

Recent Articles