Sitetracker

Site Reliability Engineer

Posted 12 Days Ago

Hiring Remotely in Canada

Remote

97K-149K Annually

Senior level

Artificial Intelligence • Software • Analytics • Utilities

The Role

The role focuses on building a reliability practice, defining SLIs, managing incidents, mentoring engineers, and establishing engineering standards for AI workloads. It emphasizes proactive reliability and strategic AWS usage.

Summary Generated by Built In

The Opportunity

This is your chance to build a reliability practice from the ground up and establish the engineering standards—including SLOs, error budgets, and observability—that will protect our platform as we scale for enterprise customers and expand our AI capabilities. You’ll have the autonomy to set the strategy and the trust to execute it, ensuring that our AI workloads (Evals, RAG, and LLM processing) meet the highest reliability standards. If you are a proactive problem solver who treats toil as an engineering challenge and wants the agency to decide which technologies to adopt and when, you will find this to be a career-defining role.

What You'll Do

As a Staff or Senior Staff SRE, you’ll hit the ground running by partnering with the engineers currently managing reliability to transition the organization from reactive firefighting to a proactive, disciplined reliability practice. You will lead the deliberate evolution of our infrastructure, recognizing the inflection point for new tooling and leading migrations away from manual scripts and templates only when they’ve earned their keep. Whether you are architecting incident response structures or solving novel reliability problems for AI agents, your work will act as a multiplier that empowers the entire engineering team.

By bringing a consulting mindset to every challenge, you’ll propose technical trade-offs based on evidence rather than reflex, ensuring our roadmap for multi-region or service mesh adoption is built for tomorrow. You won't just be handed tasks; you will own the strategy for production-readiness and deploy safety, building the organizational trust needed to make reliability a core differentiator of our product.

The Skills You'll Have

Deep SRE Expertise

Define SLIs and SLOs for critical user journeys and use them to drive proactive engineering decisions.
Run live production incident response as an Incident Commander and lead blameless postmortems that result in shipped follow-up actions.
Builds observability that tells a story -- dashboards that explain a system's behavior to someone seeing it for the first time -- and actionable alerts.
Take an organization from reactive firefighting to a working reliability practice with measurable improvements in paging volume.
Design error-budget policies and use them to make data-driven trade-offs between shipping features and maintaining reliability.

Deep Technical Expertise in AWS

Designs and operates services on AWS competently — VPC, IAM, compute (ECS/EC2/Lambda), managed data services, and load balancing.
Navigate our current setup of CloudFormation and bash scripts via GitHub Actions effectively without reaching for Terraform reflexively.
Debug production AWS issues at the network and IAM level without escalating to AWS support as a first step.
Design and roll out production workloads across multiple regions and countries while accounting for data residency and regional failure modes.
Lead high-stakes tooling migrations into established environments and manage the long-term consequences of those architectural choices.

Impact, Leadership & Team Enablement

Mentor engineers through pair debugging, postmortem coaching, and runbook reviews to leave the team more capable.
Define alerts for impactful metrics and write the clear, actionable runbooks that go with them.
Work with engineering teams to gather requirements for new infrastructure and conduct constructive production-readiness reviews.
Teach teams how to build their own observability dashboards, raising the technical floor across the entire organization.
Use AI tooling aggressively, including coding agents and log analysis, to accelerate the delivery of impactful changes.

Communication & Influence

Communicate scheduled downtime and infrastructure changes to stakeholders proactively with clear timing and expected impact.
Write postmortems that both engineers and non-engineers can read, understand, and learn from.
Act as the recognized Subject Matter Expert for AWS-related questions across the engineering organization.
Influence product and engineering roadmap decisions by using data and evidence rather than opinion when reliability is a factor.
Build organizational trust so that teams seek out the SRE practice early in the development cycle to make their work better.

Within 90 Days, You'll

Fully onboard and partner with the engineers currently managing reliability to review and revise the existing operational plan.

Operationalize high-leverage items to transition the team out of reactive firefighting and into a more stable, proactive state.

Establish a baseline for current system behavior by identifying the most critical user journeys that require immediate SLI/SLO definitions.

Within 180 Days, You'll

Independently drive the revised reliability plan, ensuring SLIs/SLOs are in place and actively used to guide engineering decisions.

Standardize the incident response structure, including severity definitions, Incident Commander roles, and a cadence for blameless postmortems.

Measurably reduce paging volume and ensure that every alert that pages an engineer is backed by a clear, effective runbook.

Within 365 Days, You'll

Establish a mature reliability practice where production-readiness reviews and error-budget conversations are default parts of the development lifecycle.

Define a clear, evidence-based tooling roadmap for the next phase of our evolution, such as Terraform, service mesh, or multi-region expansion.

Serve as an organizational multiplier, having built the observability and culture necessary for other engineers to reason about reliability without constant supervision.

Skills Required

Deep expertise in Site Reliability Engineering
Proficiency in AWS services and architecture
Experience with incident response and blameless postmortems
Ability to mentor and enable engineering teams
Strong communication and influence in reliability practices

View all jobs at Sitetracker

View Sitetracker Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

Montclair, NJ

301 Employees

Year Founded: 2013

What We Do

Our Mission: Power the successful deployment of critical infrastructure Sitetracker, Inc. is the global standard for deploying, operating and servicing critical infrastructure and technology. The Sitetracker Platform enables growth-focused innovators to optimize the entire asset lifecycle through native platform inclusions like AI, automation, and actionable analytics. From the field to the C-suite, Sitetracker enables stakeholders to optimize how they plan, deploy, maintain, and grow their capital asset portfolios. Market leaders in the telecommunications, alternative energy, and utility industries — such as Ericsson, Fortis, Google, British Telecom, and Vodafone — rely on Sitetracker to manage millions of sites and projects representing over $25 billion of portfolio holdings globally.