Job Title, Company or Keyword

Maximum of 25 job preferences reached.

Top Site Reliability Engineer Jobs

Okta

Staff Site Reliability Engineer, Kubernetes w/ active TS/SCI

Reposted 3 Days AgoSaved

In-Office

Washington, DC, USA

188K-259K Annually

Senior level

188K-259K Annually

Senior level

Cloud

The Staff Site Reliability Engineer will lead the design of AWS solutions, manage incident responses, and mentor junior engineers, ensuring reliability and security in federal environments.

Top Skills: AWSDatabricksGoHelmKubernetesRedshiftSnowflakeTerraform

Watershed Informatics

Site Reliability Engineer

Reposted 3 Days AgoSaved

In-Office

Cambridge, MA, USA

Mid level

Cloud • Information Technology • Biotech

The Site Reliability Engineer will build and deploy Linux servers, research technologies, monitor system performance, and resolve technical incidents.

Top Skills: Infrastructure-As-CodeLinuxNetworkingVirtualization

Zoox

Staff Software Engineer - SRE, GitHub & CI/CD Infrastructure

Reposted 3 Days AgoSaved

Hybrid

Foster City, CA, USA

250K-300K Annually

Senior level

250K-300K Annually

Senior level

Artificial Intelligence • Machine Learning • Robotics • Software • Transportation • Design • Manufacturing

The Staff Site Reliability Engineer will lead source control strategy, manage Git-based monorepo operations, improve developer productivity, and oversee migrations to GitHub Cloud.

Top Skills: BazelBuckBuildkiteGerritGithub ActionsGithub CloudGithub EnterpriseGitlab CiJenkinsPulumiReviewableTerraform

IntelliPro Group Inc.

Cloud SRE Engineer - Mandarin Bilingual

Reposted 3 Days AgoSaved

In-Office

Palo Alto, CA, USA

70-100 Hourly

Entry level

70-100 Hourly

Entry level

HR Tech • Information Technology

Looking for a Cloud SRE Engineer to ensure reliability and stability of cloud services, perform troubleshooting, and collaborate across teams. Bilingual in Mandarin preferred.

Top Skills: AWSAzureCloudwatchGCPGoGrafanaKubernetesPrometheusPythonShell

Socure

Senior Software Engineer - SRE

Reposted 3 Days AgoSaved

Remote or Hybrid

4 Locations

160K-180K Annually

Senior level

160K-180K Annually

Senior level

Artificial Intelligence • Machine Learning • Software • Analytics

The role involves end-to-end ownership of AWS infrastructure, managing Kubernetes platforms, and ensuring system reliability through observability and automation. Responsibilities include incident response and maintaining CI/CD systems.

Top Skills: ArgocdAWSDatadogGitGoKubernetesPythonTerraform

SpaceX

Site Reliability Engineer (Application Software)

Reposted 3 Days AgoSaved

In-Office

Hawthorne, CA, USA

125K-175K Annually

Mid level

125K-175K Annually

Mid level

Aerospace • Other

The Site Reliability Engineer will manage and maintain mission-critical applications, improve software development processes, and provide end-user support, emphasizing safety and performance optimization.

Top Skills: AnsibleBazelBuckC#C++ClickhouseDockerJavaScriptKubernetesLinuxMakeMySQLPostgresPuppetPythonTerraform

Encora

Senior Application Support Engineer (SRE)

Reposted 3 Days AgoSaved

Remote

United States

Mid level

Software • Consulting

The Senior Application Support Engineer leads efforts to ensure application reliability, manages incidents, collaborates with teams, and monitors performance, providing 24/7 support.

Top Skills: AppdynamicsAWSDatadogLinuxMulesoftOpentelemetryPythonServicenowSplunk

Pico (pico.net)

Site Reliability Engineer

Reposted 3 Days AgoSaved

In-Office

New York, NY, USA

115K-125K Annually

Mid level

115K-125K Annually

Mid level

Fintech • Payments • Financial Services

The Site Reliability Engineer will assist clients with Redline products, manage production environments, troubleshoot issues, and ensure automation and customer satisfaction.

Top Skills: C/C++JavaLinuxPython

T. Rowe Price

Principal Site Reliability Engineer, Infrastructure Observability

Reposted 3 Days AgoSaved

In-Office

Owings Mills, MD, USA

159K-339K Annually

Senior level

159K-339K Annually

Senior level

Financial Services

As a Principal Site Reliability Engineer, you'll lead a team focusing on observability and automating solutions for cloud and on-prem infrastructures, enhancing reliability and incident response across T. Rowe Price's tech ecosystem.

Top Skills: .Net CoreAmazon AwsAnsibleElastic StackGoGrafanaJavaMySQLNew RelicNode.jsPostgresPrometheusPythonSolarwinds DpaSplunkSQL ServerTerraformVagrantVault

Learning Technologies Group plc

Site Reliability Engineer (Rustici) US, Franklin, Remote

Reposted 3 Days AgoSaved

In-Office or Remote

Franklin, TN, USA

Mid level

Edtech

The Site Reliability Engineer enhances application deployment in AWS, monitors systems, improves automation, and collaborates with teams on security and performance.

Top Skills: AnsibleAWSCloudFormationCSSDockerGithub ActionsGoHTMLInfrastructure As CodeJavaJavaScriptJenkinsKubernetesPythonTerraformTypescript

Cox Automotive Inc.

LEAD SITE RELIABILITY ENGINEER

4 Days AgoSaved

In-Office

Austin, TX, USA

167K-204K Annually

Senior level

167K-204K Annually

Senior level

Automotive • Information Technology • Logistics • Software

Lead Site Reliability Engineer implements IaC and automation, builds observability (SLIs/SLOs, dashboards, alerting), manages incident response, runbooks, gamedays, postmortems, and drives SRE/DevOps best practices, AppSec integration, testing, and CI/CD improvements across teams.

Top Skills: AppsecAWSAws CloudformationC#Ci/CdCloudsploitCloudwatchData TheoremDatadogGrafanaIacInfrastructure As CodeJavaNewrelicPythonTerraformVeracode

Green Dot Corporation

Lead Site Reliability Engineer

4 Days AgoSaved

In-Office

Los Angeles, CA, USA

140K-199K Annually

Senior level

140K-199K Annually

Senior level

Fintech • Financial Services

Lead SRE responsible for reliability, scalability, and performance of systems. Design automated deployments, build and govern monitoring/observability, define SLIs/KPIs, collaborate across teams to improve release and delivery processes, and participate in on-call incident response.

Top Skills: AlertingAWSAzureBashGCPLoggingMessaging/Event BusMetricsMonitoringObservabilityPowershellPython

New

Cut your apply time in half.

Use ourAI Assistantto automatically fill your job applications.

Use For Free

Early Warning

Staff Site Reliability Engineer - Paze

4 Days AgoSaved

In-Office

3 Locations

116K-174K Annually

Senior level

116K-174K Annually

Senior level

Fintech

Lead SRE work partnering with development teams to design and implement availability, scalability, observability, and automation for production systems. Build tooling, manage incident response and RCAs, optimize capacity and performance, mentor engineers, maintain runbooks, and participate in a 24x7 on-call rotation.

Top Skills: AuroraAWSChefCi/CdDockerDynamoDBGitGoIpJavaJavaScriptJenkinsJmsKafkaKubernetesLinuxMavenMemcachedMicroservicesObservabilityOraclePythonRedisRubySqsSwarmTcpUdp

Ad Hoc

Site Reliability Engineer

4 Days AgoSaved

In-Office

McLean, VA, USA

125K-135K Annually

Senior level

125K-135K Annually

Senior level

Software

Ensure availability, performance, and reliability of a federal cloud platform. Monitor platform health and SLOs, build observability (metrics, logging, alerting, dashboards), participate in on-call and incident response, run postmortems, automate operational toil, support capacity planning and performance tuning on AWS/EKS, implement infrastructure as code with Terraform, and collaborate with application teams and government partners.

Top Skills: Amazon EksAWSKubernetesTerraform

SEI

Senior DevOps/SRE Engineer

4 Days AgoSaved

In-Office

2 Locations

140K-170K Annually

Senior level

140K-170K Annually

Senior level

Financial Services

Design, build, and operate reliable cloud infrastructure and networking (multi-account AWS, VPC, IAM). Implement IaC, CI/CD pipelines, observability (logging/metrics/alerting), automation, and reliability guardrails. Provide production support and incident response, perform root cause analysis, and collaborate with application teams to co-own system design and continuous improvement, using AI-assisted tools where appropriate.

Top Skills: .NetAi-Assisted Tools (Claude CodeAWSAws OrganizationsBashCi/CdCloudFormationElastic StackGitGithub CopilotIamInfrastructure As CodeJavaJenkinsNode.jsObservabilityOpensearchPowershellPythonTerraformVpcWindsurf)

Fortinet

Site Reliability Engineer

4 Days AgoSaved

In-Office

Sunnyvale, CA, USA

170K-200K Annually

Senior level

170K-200K Annually

Senior level

Security • Software • Cybersecurity

Hands-on Site Reliability Engineer responsible for building and maintaining cloud infrastructure, CI/CD pipelines, observability (logging/monitoring/tracing), automation, and security best practices. Manage datacenter resources, troubleshoot clusters and services, collaborate with engineering teams for deployments, and participate in on-call incident response to ensure high availability and performance.

Top Skills: AnsibleArgocdBashChefDatadogElkGitlab CiGoGrafanaJenkinsKubernetesLinuxPrometheusPythonRancher

LSEG (London Stock Exchange Group)

Director of SRE

Reposted 4 Days AgoSaved

In-Office

2 Locations

Expert/Leader

Fintech • Analytics

The Director of SRE leads a global SRE organization, driving operational excellence, incident management, automation, and reliability across financial systems while mentoring teams and improving collaboration with stakeholders.

Top Skills: Agentic AiAnsibleApi GatewayAuroraAWSC#.NetCi/CdCloudwatchDatadogDynamoDBEcsEksElkGitGoGrafanaJavaLambdaLinuxOpentelemetryPostgresPrometheusPythonPythonSQL ServerSybaseTerraformUnix

Cisco ThousandEyes

Senior Site Reliability Engineer (FedRAMP) - ThousandEyes

Reposted 9 Days AgoSaved

Hybrid

3 Locations

147K-278K Annually

Senior level

147K-278K Annually

Senior level

Cloud • Software

Responsible for maintaining FedRAMP-compliant infrastructure, collaborating with software engineers, and ensuring system availability and security. Duties include infrastructure design, automation, monitoring, and incident response.

Top Skills: AWSGoKubernetesPuppetPythonTerraform

ADT

Site Reliability Engineer

4 Days AgoSaved

In-Office

Blue Bell, PA, USA

Mid level

Security

Maintain and improve reliability, scalability, and performance of distributed systems. Build and manage infrastructure as code, support cloud and Kubernetes environments, implement observability and monitoring, participate in incident response and on-call rotations, and collaborate with cross-functional teams to drive operational excellence.

Top Skills: AnsibleAWSBashCi/CdDynatraceGCPJavaKubernetesPrometheusPythonTerraform

Andromeda (andromeda.ai)

Site Reliability Engineer - AI Infrastructure

Reposted 4 Days AgoSaved

In-Office or Remote

8 Locations

Senior level

Artificial Intelligence • Cloud • Information Technology • Software

The Site Reliability Engineer will provision and manage Kubernetes clusters, build automation tools, debug customer issues, and improve infrastructure reliability.

Top Skills: AnsibleBashDatadogGoGrafanaHelmKubernetesLokiPrometheusPythonTerraform

AutoZone

Systems Engineer – SRE Enablement

5 Days AgoSaved

In-Office

Memphis, TN, USA

Senior level

Automotive • eCommerce • Retail • Sales

Lead SRE enablement by defining SLO/SLO frameworks, production readiness, and reliability playbooks. Build and standardize observability (Dynatrace), provide alerting/dashboard/runbook templates, coach teams on SRE practices, run training, participate in incident post-mortems, and report enterprise reliability metrics while advising on architecture for hybrid GCP and on-prem environments.

Top Skills: AnsibleApmDynatraceGoGoogle Cloud Platform (Gcp)JavaKubernetesObservabilityPythonTerraform

Finalsite

Staff Site Reliability Engineer

5 Days AgoSaved

In-Office or Remote

The Center, IN, USA

180K-250K Annually

Expert/Leader

180K-250K Annually

Expert/Leader

Edtech • Information Technology • Software

Lead infrastructure, reliability, and observability across multi-cloud environments. Improve CI/CD, IaC standards, staging parity, Kubernetes operations, monitoring and SLOs, incident response, and platform modernization while partnering with engineering teams.

Top Skills: Ai-Assisted Development Tools (Claude CodeAutoscalingAWSCi/Cd PipelinesCodex)Event-Driven ArchitecturesGCPIncident ManagementInfrastructure-As-CodeKubernetesMonitoringObservabilityPythonQueue-Based ArchitecturesRuby On RailsSlo FrameworksTerraform

Pragmatike

SRE / Network Engineer (MAAS) - Remote US

5 Days AgoSaved

Remote or Hybrid

Chicago, IL, USA

Senior level

Information Technology • Software

Seek an SRE/Network Engineer with deep MAAS and bare-metal automation expertise to manage hundreds of nodes across distributed sites. Responsibilities include Linux administration, hardware-level diagnostics (BIOS/IPMI/RAID), network design (VLANs/L2-L3/VPN/UniFi), infrastructure automation (Ansible, Bash/Python, Git), observability (Prometheus/Grafana, ELK/Graylog/Loki), PXE/MAAS-based OS provisioning, API integrations, virtualization (OpenStack/Kolla-Ansible, Proxmox, VMware), and container workload support.

Top Skills: AnsibleBashBiosCloud-InitCloudflare ApiDebianElkGitGitopsGrafanaGraylogIpmiIronicKolla-AnsibleL2 RoutingL3 RoutingLinuxLokiMaasOpenstackPreseedPrometheusProxmox VePxePythonRaidUbuntuUnifiVlanVmware EsxiVpn

Nebius

Staff Network Site Reliability Engineer

5 Days AgoSaved

Remote

United States

180K-224K Annually

Senior level

180K-224K Annually

Senior level

Artificial Intelligence • Information Technology • Consulting

Build and operate Nebius's network infrastructure: define SLIs/SLOs, improve site and inter-site reliability, lead incident response and postmortems, develop observability and alerting, automate change workflows, and collaborate with network and platform teams to embed operability.

Top Skills: Ci/CdContainer PlatformsGoInfrastructure As CodeLinuxPython

Akamai Technologies

Site Reliability Engineer

5 Days AgoSaved

In-Office or Remote

2 Locations

76K-136K Annually

Mid level

76K-136K Annually

Mid level

Cloud • Security • Software • Cybersecurity

Design, develop, test, and operate scalable infrastructure and services for Akamai Cloud. Implement and manage Infrastructure-as-Code (Terraform and similar tools), CI/CD, and observability. Automate reliability improvements, mentor engineers, collaborate on incident response and root-cause remediation, and participate in on-call rotations.

Top Skills: Alerting)AnsibleChefCi/CdInfrastructure As CodeLinuxLoggingObservability (MonitoringPuppetSaltstackTerraform