Top Site Reliability Engineer Jobs

3 Days AgoSaved
In-Office
McLean, VA, USA
135K-150K Annually
Senior level
135K-150K Annually
Senior level
Software
Lead reliability for a large federal cloud platform: define SLOs, build observability, run incident response and postmortems, automate toil, design AWS/EKS infrastructure, mentor engineers, and present reliability designs to stakeholders.
Top Skills: Amazon EksAWSAws CertificationCkaCksFedrampKubernetesNist 800-53Observability (Metrics/Logging/Tracing/Alerting)SlosTerraformZero-Trust
3 Days AgoSaved
Hybrid
Philadelphia, PA, USA
Senior level
Senior level
Payments
Senior SRE responsible for ensuring high availability and resiliency of a global payments platform by building observability, automations, AI-driven remediation, incident response, and self-healing workflows; participates in on-call rotation and hybrid Philadelphia-based work.
Top Skills: AiopsAksAnthropic (Claude)ApmAzureAzure Ai (Foundry)Azure Sre AgentCi/CdDatadogDnsDynatraceHTTPHttpsIisKubernetesLoad BalancingNew RelicOpenai (Codex)Pagerduty Process AutomationPowershellPythonRundeckSQLT-SqlTcp/IpVMwareWindows Server
3 Days AgoSaved
In-Office
3 Locations
153K-192K Annually
Senior level
153K-192K Annually
Senior level
Big Data • Fintech • Mobile • Payments • Financial Services • Data Privacy
Senior SRE responsible for designing and maturing cloud reliability on GCP/Azure: build observability, define SLIs/SLOs, create Terraform modules and CI/CD automation, lead incident/root-cause investigations, partner with security/governance, mentor engineers, and drive platform resiliency and production readiness.
Top Skills: Azure Log AnalyticsAzure Resource GraphCi/CdDevsecopsDnsDynatraceFirewallsGenaiGoogle Cloud Platform (Gcp)IamLoad BalancingAzurePolicy-As-CodeTerraformTerraform EnterpriseVpc
Reposted 3 Days AgoSaved
Hybrid
2 Locations
Senior level
Senior level
Healthtech
As a Senior Site Reliability Engineer, you will ensure the reliability and performance of our Azure-based healthcare platform, implementing SRE practices, driving incident management, and automating operational tasks.
Top Skills: AzureAzure MonitorBashDatadogPowershellPythonTerraform
Reposted 3 Days AgoSaved
In-Office or Remote
8 Locations
Senior level
Senior level
Artificial Intelligence • Cloud • Information Technology • Software
Design and operate large-scale GPU infrastructure for distributed AI training, ensuring reliability, performance, and efficient customer partnerships.
Top Skills: AnsibleCudaDeepspeedFsdpGpuHelmInfinibandKubernetesLinuxMegatronNcclNvidia A100Nvidia B200Nvidia H100NvlinkPyTorchRoceTerraform
4 Days AgoSaved
In-Office
3 Locations
106K-156K Annually
Senior level
106K-156K Annually
Senior level
Fintech
Design, build, and maintain scalable, reliable application infrastructure. Automate deployments and configuration, implement observability and monitoring, troubleshoot performance, advise development teams on SDLC and microservice best practices, create runbooks, participate in 24x7 on-call rotation, and ensure security and disaster recovery readiness.
Top Skills: AWSCi/CdDockerGitGoIpJavaJavaScriptKubernetesLinuxMonitoringObservabilityPythonRubyScripting LanguagesSecurity Encryption ProtocolsSwarmTcpUdp
4 Days AgoSaved
Hybrid
San Francisco, CA, USA
196K-255K Annually
Senior level
196K-255K Annually
Senior level
Artificial Intelligence • Big Data • Software
Own and improve infrastructure for the Data Replication platform: Kubernetes, CI/CD, secrets, networking, cloud (AWS/GCP). Drive reliability, observability, AI-augmented tooling, canary rollouts, incident reduction, runbooks, and partner with product engineers.
Top Skills: Agentic FrameworksAirbyteAWSCdksCi/CdConnector-Based ArchitecturesDatadogGCPGrafanaHelmJavaKubernetesLlmsPrometheusPythonSecrets ManagementTerraform
4 Days AgoSaved
In-Office or Remote
2 Locations
121K-219K Annually
Senior level
121K-219K Annually
Senior level
Cloud • Security • Software • Cybersecurity
Lead reliability, automation, and observability for high-density AI hardware infrastructure. Build Python-based IaC tooling, telemetry pipelines, Prometheus/Grafana dashboards, and AI-assisted tooling. Run 24x7 incident response, coordinate vendors and field technicians, define operational readiness, and drive post-mortems to improve uptime and performance.
Top Skills: Bare-MetalBgpGrafanaIpv4Ipv6LlmsLokiOpentelemetryPagerdutyPrivate CloudPrometheusPythonSlackTimeseries EnginesVirtualized Environments
4 Days AgoSaved
In-Office or Remote
2 Locations
121K-219K Annually
Senior level
121K-219K Annually
Senior level
Cloud • Security • Software • Cybersecurity
Design, build, and operate scalable infrastructure and CI/CD/IaC systems. Implement observability (monitoring, logging, alerting), automate reliability improvements, mentor engineers, collaborate on incident response, and participate in on-call rotations to maintain Akamai Cloud services.
Top Skills: AlertingAnsibleBashChefCi/CdGithub ActionsGitlab Ci/CdGoInfrastructure As CodeJenkinsLoggingMonitoringPuppetPythonSaltstackTelemetryTerraform
4 Days AgoSaved
Hybrid
San Francisco, CA, USA
Senior level
Senior level
Artificial Intelligence • Software • Generative AI
Ensure reliability and performance of Plaud.ai's AI products at scale by designing and operating cloud-native systems, owning production reliability and incident response, building observability and automation, defining SLOs/SLIs, driving postmortems, and partnering with product and engineering teams to improve operational maturity.
Top Skills: AWSAzureGCPGoJavaKubernetesPython
4 Days AgoSaved
Remote
USA
Senior level
Senior level
Software
Drive SRE practices for VA enterprise healthcare platforms: automate infrastructure and CI/CD, define SLIs/SLOs, improve observability and reliability, support incident response, and ensure cloud-native, secure, compliant operations in AWS and containerized environments.
Top Skills: AnsibleAWSBashCi/CdCloudwatchDockerEcsEksElkGoGrafanaInfrastructure As CodeKubernetesLinuxOpentelemetryPowershellPrometheusPythonSplunkTerraform
Reposted 4 Days AgoSaved
Remote
2 Locations
Senior level
Senior level
Artificial Intelligence • Information Technology • Software • Database
As a Site Reliability Engineer, you will design, implement, and maintain scalable infrastructure, ensure system reliability, automate processes, and collaborate with engineering teams.
Top Skills: DockerElk StackGoGrafanaJavaKubernetesNode.jsPrometheusPulumiPythonRubyTerraform
New

Track Smarter, Apply Better.

Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.

Use For Free
Application Tracker Preview
Reposted 5 Days AgoSaved
In-Office or Remote
7 Locations
Senior level
Senior level
Software
The Senior Site Reliability Engineer will lead service onboarding, maintain SLAs/SLOs, design secure infrastructure, automate operational tasks, and respond to incidents while ensuring system reliability and performance.
Top Skills: AWSCloudFormationElk StackGoGrafanaHadoopKubernetesPythonTerraform
Reposted 5 Days AgoSaved
In-Office
Chicago, IL, USA
106K-145K Annually
Senior level
106K-145K Annually
Senior level
Artificial Intelligence • Automotive • Internet of Things • Software
The Senior Site Reliability Engineer will manage system health, automate solutions, resolve incidents, and collaborate across teams to enhance performance and reliability.
Top Skills: APIsArmAzureAzure CliBicepCloud InfrastructureDevOpsGitPowershellTerraformVirtualization
6 Days AgoSaved
In-Office
Hawthorne, CA, USA
165K-230K Annually
Senior level
165K-230K Annually
Senior level
Aerospace • Other
Build, operate, and scale mission-critical application platforms to accelerate vehicle software delivery. Manage infrastructure as code, improve observability, collaborate with developers, run on-call rotations, conduct blameless postmortems, and reduce performance bottlenecks to support Falcon, Starship, Dragon, and Starlink software lifecycles.
Top Skills: AnsibleBazelBuckC#C++ClickhouseDockerJavaScriptKubernetesKvmLinuxMakeMySQLPostgresPuppetPythonQemuTerraformVsphere
Reposted 6 Days AgoSaved
Remote
USA
117K-181K Annually
Senior level
117K-181K Annually
Senior level
Other • Social Impact
As a Senior Site Reliability Engineer, you will design, develop, and maintain reliable infrastructure for Wikimedia's API services, ensuring performance and availability while driving reliability engineering practices and improving developer experience.
Top Skills: AnsibleArgocdAWSAzureGCPGitlabGoKubernetesOpentelemetryPrometheusPythonTerraform
Reposted 6 Days AgoSaved
Remote
USA
113K-176K Annually
Senior level
113K-176K Annually
Senior level
Other • Social Impact
The Senior Site Reliability Engineer is responsible for maintaining Wikimedia's infrastructure, improving reliability, automating processes, and collaborating with teams. The role involves troubleshooting, managing deployments, and leading incident responses while working remotely.
Top Skills: AnsibleBashCassandraDebianGoGrafanaHhvmKubernetesMariadbMemcachedPHPPrometheusPuppetPythonRedisRubyShell
7 Days AgoSaved
Remote
US
110K-137K Annually
Senior level
110K-137K Annually
Senior level
Financial Services
Prototype, write, test, document, and deploy release automation across environments. Build and maintain pipelines, collaborate with engineers and product teams, troubleshoot issues, participate in on-call rotation, and improve software delivery, configuration, monitoring, and operations.
Top Skills: AnsibleBashDockerGitlabJenkinsKubernetesMssqlPostgresPowershellPythonRedisTeamcity
7 Days AgoSaved
In-Office
Raleigh, NC, USA
Senior level
Senior level
Edtech
Maintain and improve site performance, uptime, and scalability. Build monitoring, alerting, runbooks, deployment tooling, and scalable architecture. Troubleshoot across the stack and partner with application teams to deliver reliable production systems.
Top Skills: AWSBashCC++DockerGCPJavaKubernetesPerlPython
7 Days AgoSaved
Remote or Hybrid
Redmond, WA, USA
120K-150K Annually
Senior level
120K-150K Annually
Senior level
Healthtech • Software • Analytics • Business Intelligence
Lead and own reliability for critical backend and distributed systems: design, launch, on-call, incident leadership, SLO/SLI/error budget definition, automation to remove toil, observability improvement, resilience testing, mentoring, and cross-team reliability initiatives for production healthcare workflows.
Top Skills: AWSAzureDockerGCPGithub ActionsGoGrafanaJavaKubernetesOpentelemetryPrometheusPythonTerraformTypescript
8 Days AgoSaved
In-Office
Irvine, CA, USA
140K-180K Annually
Senior level
140K-180K Annually
Senior level
Hardware • Manufacturing
Lead implementation and operation of microservices on Kubernetes across multi-cloud environments. Build observability, run load/chaos tests, define SLOs/SLA/SLIs, automate with scripts, ensure security/compliance, lead incident response, perform DR planning, mentor teammates, and participate in on-call rotation.
Top Skills: Application SecurityAWSAzureBashData ProtectionGCPGdprGoHpaIdentity And Access Management (Iam)Iso27001JavaJvmKubernetesMicroservicesNetwork SecurityObservabilityOciPowershellPythonSoc2
8 Days AgoSaved
In-Office
El Segundo, CA, USA
153K-185K Annually
Senior level
153K-185K Annually
Senior level
Aerospace • Hardware • Software • Biotech • Pharmaceutical • Manufacturing
Lead design, build, and operate mission-critical infrastructure across cloud, on-prem, and spacecraft contexts. Implement IaC, CI/CD, observability, and scalable Kubernetes-based systems; respond to incidents, perform root cause analysis, optimize performance, and collaborate with software and hardware teams. Participate in on-call rotations and occasional travel.
Top Skills: AnsibleArgocdAzureBashCi/CdContainerdDatabasesDockerFirewallsGitopsGpu WorkloadsGrafanaHpcInfluxdbKubernetesLinuxPowershellPrometheusPythonSaltSlurmSubnetsTerraformVpcVpns
8 Days AgoSaved
Remote
US
Senior level
Senior level
Healthtech • Social Impact • Software
Own the operational lifecycle of cloud-native data infrastructure: design and automate reliable deployments, observability, incident response, SLIs/SLOs, autoscaling and IaC, and improve platform efficiency and data freshness across GKE and Cloud Run.
Top Skills: BashBigQueryCloud BuildCloud MonitoringCloud RunDatadogDockerGCPGithub ActionsGkeGoGrafanaJIRAKubernetesPrometheusPulumiPythonSentrySlackSnykSonarqubeTerraform
8 Days AgoSaved
In-Office or Remote
15 Locations
100K-210K Annually
Senior level
100K-210K Annually
Senior level
Information Technology • Legal Tech • Analytics
Design, build, and operate highly available AWS systems. Write and maintain Terraform, improve observability (Grafana, Pingdom, Uptrends), run on-call incident response, define SLOs/SLIs, build CI/CD with Azure DevOps/GitHub, automate operational work, document in Confluence, and mentor engineers.
Top Skills: AWSAzure DevopsCi/CdConfluenceDockerGitGitGrafanaJIRAKubernetesLinuxPingdomServicenowTerraformUptrends
8 Days AgoSaved
In-Office or Remote
9 Locations
105K-198K Annually
Senior level
105K-198K Annually
Senior level
Information Technology • Legal Tech • Analytics
Design, deploy, and maintain highly available Kubernetes clusters on AWS EKS; manage and optimize cloud infrastructure; develop IaC and automation; implement CI/CD (GitHub Actions); monitor multi-region systems, troubleshoot incidents, perform root cause analysis; document best practices; and mentor junior engineers.
Top Skills: AWSAws EksCi/CdContainersGithub ActionsInfrastructure As CodeKubernetesNewrelicPythonRbac
All Filters
JobType
New Jobs
Job Category
Experience
Industry
Company Name
Company Size

Sign up now Access later

Create Free Account