Site Reliability Engineer

Posted 12 Days Ago
Hiring Remotely in Canada
Remote
97K-149K Annually
Senior level
Artificial Intelligence • Software • Analytics • Utilities
The Role
The role focuses on building a reliability practice, defining SLIs, managing incidents, mentoring engineers, and establishing engineering standards for AI workloads. It emphasizes proactive reliability and strategic AWS usage.
Summary Generated by Built In
The Opportunity

This is your chance to build a reliability practice from the ground up and establish the engineering standards—including SLOs, error budgets, and observability—that will protect our platform as we scale for enterprise customers and expand our AI capabilities. You’ll have the autonomy to set the strategy and the trust to execute it, ensuring that our AI workloads (Evals, RAG, and LLM processing) meet the highest reliability standards. If you are a proactive problem solver who treats toil as an engineering challenge and wants the agency to decide which technologies to adopt and when, you will find this to be a career-defining role.

What You'll Do

As a Staff or Senior Staff SRE, you’ll hit the ground running by partnering with the engineers currently managing reliability to transition the organization from reactive firefighting to a proactive, disciplined reliability practice. You will lead the deliberate evolution of our infrastructure, recognizing the inflection point for new tooling and leading migrations away from manual scripts and templates only when they’ve earned their keep. Whether you are architecting incident response structures or solving novel reliability problems for AI agents, your work will act as a multiplier that empowers the entire engineering team.

By bringing a consulting mindset to every challenge, you’ll propose technical trade-offs based on evidence rather than reflex, ensuring our roadmap for multi-region or service mesh adoption is built for tomorrow. You won't just be handed tasks; you will own the strategy for production-readiness and deploy safety, building the organizational trust needed to make reliability a core differentiator of our product.

The Skills You'll Have

Deep SRE Expertise

  • Define SLIs and SLOs for critical user journeys and use them to drive proactive engineering decisions.

  • Run live production incident response as an Incident Commander and lead blameless postmortems that result in shipped follow-up actions.

  • Builds observability that tells a story -- dashboards that explain a system's behavior to someone seeing it for the first time -- and actionable alerts.

  • Take an organization from reactive firefighting to a working reliability practice with measurable improvements in paging volume.

  • Design error-budget policies and use them to make data-driven trade-offs between shipping features and maintaining reliability.

Deep Technical Expertise in AWS

  • Designs and operates services on AWS competently — VPC, IAM, compute (ECS/EC2/Lambda), managed data services, and load balancing.

  • Navigate our current setup of CloudFormation and bash scripts via GitHub Actions effectively without reaching for Terraform reflexively.

  • Debug production AWS issues at the network and IAM level without escalating to AWS support as a first step.

  • Design and roll out production workloads across multiple regions and countries while accounting for data residency and regional failure modes.

  • Lead high-stakes tooling migrations into established environments and manage the long-term consequences of those architectural choices.

Impact, Leadership & Team Enablement

  • Mentor engineers through pair debugging, postmortem coaching, and runbook reviews to leave the team more capable.

  • Define alerts for impactful metrics and write the clear, actionable runbooks that go with them.

  • Work with engineering teams to gather requirements for new infrastructure and conduct constructive production-readiness reviews.

  • Teach teams how to build their own observability dashboards, raising the technical floor across the entire organization.

  • Use AI tooling aggressively, including coding agents and log analysis, to accelerate the delivery of impactful changes.

Communication & Influence

  • Communicate scheduled downtime and infrastructure changes to stakeholders proactively with clear timing and expected impact.

  • Write postmortems that both engineers and non-engineers can read, understand, and learn from.

  • Act as the recognized Subject Matter Expert for AWS-related questions across the engineering organization.

  • Influence product and engineering roadmap decisions by using data and evidence rather than opinion when reliability is a factor.

  • Build organizational trust so that teams seek out the SRE practice early in the development cycle to make their work better.

Within 90 Days, You'll

  • Fully onboard and partner with the engineers currently managing reliability to review and revise the existing operational plan.

  • Operationalize high-leverage items to transition the team out of reactive firefighting and into a more stable, proactive state.

  • Establish a baseline for current system behavior by identifying the most critical user journeys that require immediate SLI/SLO definitions.

Within 180 Days, You'll

  • Independently drive the revised reliability plan, ensuring SLIs/SLOs are in place and actively used to guide engineering decisions.

  • Standardize the incident response structure, including severity definitions, Incident Commander roles, and a cadence for blameless postmortems.

  • Measurably reduce paging volume and ensure that every alert that pages an engineer is backed by a clear, effective runbook.

Within 365 Days, You'll

  • Establish a mature reliability practice where production-readiness reviews and error-budget conversations are default parts of the development lifecycle.

  • Define a clear, evidence-based tooling roadmap for the next phase of our evolution, such as Terraform, service mesh, or multi-region expansion.

  • Serve as an organizational multiplier, having built the observability and culture necessary for other engineers to reason about reliability without constant supervision.

Skills Required

  • Deep expertise in Site Reliability Engineering
  • Proficiency in AWS services and architecture
  • Experience with incident response and blameless postmortems
  • Ability to mentor and enable engineering teams
  • Strong communication and influence in reliability practices
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Montclair, NJ
301 Employees
Year Founded: 2013

What We Do

Our Mission: Power the successful deployment of critical infrastructure Sitetracker, Inc. is the global standard for deploying, operating and servicing critical infrastructure and technology. The Sitetracker Platform enables growth-focused innovators to optimize the entire asset lifecycle through native platform inclusions like AI, automation, and actionable analytics. From the field to the C-suite, Sitetracker enables stakeholders to optimize how they plan, deploy, maintain, and grow their capital asset portfolios. Market leaders in the telecommunications, alternative energy, and utility industries — such as Ericsson, Fortis, Google, British Telecom, and Vodafone — rely on Sitetracker to manage millions of sites and projects representing over $25 billion of portfolio holdings globally.

Similar Jobs

Applied Systems Logo Applied Systems

Site Reliability Engineer

Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
Remote or Hybrid
2 Locations
3040 Employees
65K-135K Annually

Oscilar Logo Oscilar

Site Reliability Engineer

Artificial Intelligence • Fintech • Software • Financial Services
Remote
2 Locations
104 Employees

Ticketmaster Logo Ticketmaster

Site Reliability Engineer

Events • News + Entertainment
In-Office or Remote
2 Locations
3850 Employees
120K-150K Annually

Applied Systems Logo Applied Systems

Senior Site Reliability Engineer

Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
Remote or Hybrid
2 Locations
3040 Employees
65K-160K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Hardware • Other • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account