Lead Site Reliability Engineer - Infrastructure

Sorry, this job was removed at 02:10 p.m. (CST) on Thursday, Jun 04, 2026
Hiring Remotely in United States
Remote or Hybrid
160K-180K Annually
Artificial Intelligence • Security • Software • Analytics • Big Data Analytics
Learn from the past. Understand the present. Predict the future.
The Role
JOB DESCRIPTION
We are seeking a Lead Site Reliability Engineer (Infrastructure) to act as technical lead for our Infrastructure SRE team in a fast-moving VSaaS engineering organization. In this role, you will own the team's technical direction and execution across reliability, scalability, and operability of our shared platform and production systems, combining hands-on technical leadership with responsibility for team outcomes.
You will define SRE strategy and guide architecture across our GCP and Kubernetes ecosystem, setting standards for reliability, scalability, GitOps, and observability. You will also mentor senior and staff engineers, and lead incident response and high-impact operational work, contributing hands-on when needed.
Role Overview
Site Reliability Engineer - Infrastructure
In this role, you will translate product and business needs into scalable infrastructure and clear technical direction. With a system-wide view of the platform, you will guide architectural decisions, surface non-obvious risks, and drive long-term improvements to system reliability and operability.
Working closely with product and platform teams, you will shape the developer experience and ensure engineering teams can ship with speed and confidence. You will set engineering standards and continuously evolve our GitOps and observability practices.
This role requires strong expertise in cloud infrastructure, distributed systems, and CI/CD, along with hands-on experience in Golang and/or Python to support automation and long-term system reliability.
Responsibilities
As a Lead Site Reliability Engineer, you will:
  • Team Leadership & Execution Ownership: Own technical direction and execution of the Infrastructure SRE team. Translate platform goals into actionable plans, ensuring alignment on priorities, reliability outcomes, and operational excellence across production systems.

  • Production Operations & Incident Management: Operate and evolve large-scale distributed systems in production, proactively identifying failure modes and mitigating risk. Own day-to-day operations including monitoring, alerting, incident response, coordination, post-incident analysis, and continuous improvement.

  • Architecture, Standards & Platform Governance: Provide architectural leadership across platform and infrastructure changes, identifying scalability constraints, system design risks, and long-term reliability gaps. Define and enforce engineering standards for GCP, Kubernetes, and ArgoCD, ensuring consistent, secure, GitOps-based delivery.

  • Reliability Engineering & Observability: Lead strategy for monitoring, alerting, and system observability, driving a shift from reactive incidents to proactive reliability engineering.

  • Enablement, CI/CD & Collaboration: Guide CI/CD and cloud-native delivery practices at scale to ensure safe, scalable releases. Mentor senior and staff engineers, conduct high-impact design and code reviews (Golang/Python), and partner with product and engineering teams to embed system-level thinking across development.

  • Hands-on Technical Contribution: Provide hands-on technical contribution where needed, including debugging production issues, reviewing and contributing to code, and supporting critical incident resolution to ensure system reliability and team effectiveness.
  • Other duties as assigned are absorbed into the above ownership and operational responsibilities.

Minimum Qualifications
  • Leadership & Experience: 10+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering, including demonstrated experience leading technical engineering teams, driving roadmaps, and owning delivery of large-scale production systems.

  • Cloud & Distributed Systems Expertise: Deep experience with cloud-native architectures and distributed systems at scale, particularly in GCP and Kubernetes environments. Ability to reason about system design, identify failure modes, and evaluate scalability and reliability risks.

  • GitOps & Delivery Engineering: Strong experience with GitOps-based delivery workflows, particularly ArgoCD, and CI/CD pipeline design. Ability to ensure safe, repeatable, and observable production deployments.

  • Infrastructure & Automation: Strong hands-on background in infrastructure-as-code (Terraform preferred), automation, and operational tooling. Proficiency in Golang and/or Python for building and reviewing production systems. Strong Linux systems knowledge and production troubleshooting experience.

  • Observability & Reliability Engineering: Experience designing or operating observability systems (logging, monitoring, alerting) and applying SRE principles such as SLOs, incident management, postmortems, and reliability engineering practices.

  • Technical Oversight & Engineering Quality: Ability to review and critique system design and production code, ensuring engineering quality across backend systems and infrastructure components.

  • Communication & Leadership Influence: Ability to influence technical direction, communicate trade-offs to stakeholders, and drive alignment across product and engineering teams on reliability and platform priorities.

Why Milestone?
Milestone offers not only great benefits but also great culture. Employees here have flexible work environments, opportunities for further education, and the ability to effect change in our Organization directly.
The annual salary for this position ranges from $160,000 to $180,000 range. Pay is based on the level, location, complexity, responsibility, and job duties of the specific position and is just one component of Milestone's total compensation package. Additionally, we offer an attractive benefits package that includes medical/dental benefits, FSA or HSA, 401k with 6% Safe Harbor employer match, paid parental leave, generous PTO (20 days' vacation, 10 days paid sick time, and 12 company holidays), fully paid Short Term disability policy, fully paid Long Term disability policy, and Life Insurance. If you are selected for an interview, please feel welcome to speak to our Talent Partner about our compensation philosophy.
All employees must complete a background check. Employees in fiscal roles are also required to undergo a credit check. All information obtained during these checks is handled confidentially and shared only with authorized personnel.
Milestone is committed to creating a diverse and inclusive workplace and is proud to be an equal opportunity employer.
Contact and application
Please apply at our website: www.milestonesys.com
We are looking forward to receiving your application

What the Team is Saying

Dylan

Milestone Systems Compensation & Benefits Highlights

  • Affordable Benefits Health coverage costs are kept low, with the employer covering most premiums for medical, dental, and vision and providing employer-paid disability and life insurance. An Employee Assistance Program further supports access to care.
  • Retirement Support Retirement savings are bolstered by a dollar-for-dollar 401(k) match on a meaningful portion of pay with immediate vesting. Auto-enrollment helps participation start early.
  • Leave & Time Off Breadth Paid time off is substantial, with separate vacation, sick time, a robust holiday schedule, and paid parental leave. Feedback suggests the time-off policies are viewed as generous.

Milestone Systems Insights

Similar Jobs

Milestone Systems Logo Milestone Systems

Solutions Engineer

Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
Remote or Hybrid
United States
1500 Employees
125K-140K Annually
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Lake Oswego, OR
1,500 Employees
Year Founded: 1998

What We Do

At Milestone System we are dedicated to making the world see. As a leading provider of data-driven video technology software, we empower people, businesses, and societies with innovative solutions that enhance security, efficiency, and insight.

Why Work With Us

We’re proud to foster a working environment that supports well-being and growth opportunities. At the organizational level, we celebrate our team members and value their personal expertise. Everyone has access to personal growth programs and health initiatives, along with the freedom to govern their work-life balance.

Gallery

Gallery

Milestone Systems Offices

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

If you live within a reasonable distance of our Lake Oswego, OR office, this will be hybrid with 3 days in the office.

Typical time on-site: 3 days a week
Company Office Image
Office Portland

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account