Maximum of 25 job preferences reached.
Top Site Reliability Engineer Jobs
Software
Lead reliability for a large federal cloud platform: define SLOs, build observability, run incident response and postmortems, automate toil, design AWS/EKS infrastructure, mentor engineers, and present reliability designs to stakeholders.
Top Skills:
Amazon EksAWSAws CertificationCkaCksFedrampKubernetesNist 800-53Observability (Metrics/Logging/Tracing/Alerting)SlosTerraformZero-Trust
Payments
Senior SRE responsible for ensuring high availability and resiliency of a global payments platform by building observability, automations, AI-driven remediation, incident response, and self-healing workflows; participates in on-call rotation and hybrid Philadelphia-based work.
Top Skills:
AiopsAksAnthropic (Claude)ApmAzureAzure Ai (Foundry)Azure Sre AgentCi/CdDatadogDnsDynatraceHTTPHttpsIisKubernetesLoad BalancingNew RelicOpenai (Codex)Pagerduty Process AutomationPowershellPythonRundeckSQLT-SqlTcp/IpVMwareWindows Server
Big Data • Fintech • Mobile • Payments • Financial Services • Data Privacy
Senior SRE responsible for designing and maturing cloud reliability on GCP/Azure: build observability, define SLIs/SLOs, create Terraform modules and CI/CD automation, lead incident/root-cause investigations, partner with security/governance, mentor engineers, and drive platform resiliency and production readiness.
Top Skills:
Azure Log AnalyticsAzure Resource GraphCi/CdDevsecopsDnsDynatraceFirewallsGenaiGoogle Cloud Platform (Gcp)IamLoad BalancingAzurePolicy-As-CodeTerraformTerraform EnterpriseVpc
Healthtech
As a Senior Site Reliability Engineer, you will ensure the reliability and performance of our Azure-based healthcare platform, implementing SRE practices, driving incident management, and automating operational tasks.
Top Skills:
AzureAzure MonitorBashDatadogPowershellPythonTerraform
Reposted 3 Days AgoSaved
Artificial Intelligence • Cloud • Information Technology • Software
Design and operate large-scale GPU infrastructure for distributed AI training, ensuring reliability, performance, and efficient customer partnerships.
Top Skills:
AnsibleCudaDeepspeedFsdpGpuHelmInfinibandKubernetesLinuxMegatronNcclNvidia A100Nvidia B200Nvidia H100NvlinkPyTorchRoceTerraform
Fintech
Design, build, and maintain scalable, reliable application infrastructure. Automate deployments and configuration, implement observability and monitoring, troubleshoot performance, advise development teams on SDLC and microservice best practices, create runbooks, participate in 24x7 on-call rotation, and ensure security and disaster recovery readiness.
Top Skills:
AWSCi/CdDockerGitGoIpJavaJavaScriptKubernetesLinuxMonitoringObservabilityPythonRubyScripting LanguagesSecurity Encryption ProtocolsSwarmTcpUdp
Artificial Intelligence • Big Data • Software
Own and improve infrastructure for the Data Replication platform: Kubernetes, CI/CD, secrets, networking, cloud (AWS/GCP). Drive reliability, observability, AI-augmented tooling, canary rollouts, incident reduction, runbooks, and partner with product engineers.
Top Skills:
Agentic FrameworksAirbyteAWSCdksCi/CdConnector-Based ArchitecturesDatadogGCPGrafanaHelmJavaKubernetesLlmsPrometheusPythonSecrets ManagementTerraform
Cloud • Security • Software • Cybersecurity
Lead reliability, automation, and observability for high-density AI hardware infrastructure. Build Python-based IaC tooling, telemetry pipelines, Prometheus/Grafana dashboards, and AI-assisted tooling. Run 24x7 incident response, coordinate vendors and field technicians, define operational readiness, and drive post-mortems to improve uptime and performance.
Top Skills:
Bare-MetalBgpGrafanaIpv4Ipv6LlmsLokiOpentelemetryPagerdutyPrivate CloudPrometheusPythonSlackTimeseries EnginesVirtualized Environments
Cloud • Security • Software • Cybersecurity
Design, build, and operate scalable infrastructure and CI/CD/IaC systems. Implement observability (monitoring, logging, alerting), automate reliability improvements, mentor engineers, collaborate on incident response, and participate in on-call rotations to maintain Akamai Cloud services.
Top Skills:
AlertingAnsibleBashChefCi/CdGithub ActionsGitlab Ci/CdGoInfrastructure As CodeJenkinsLoggingMonitoringPuppetPythonSaltstackTelemetryTerraform
Artificial Intelligence • Software • Generative AI
Ensure reliability and performance of Plaud.ai's AI products at scale by designing and operating cloud-native systems, owning production reliability and incident response, building observability and automation, defining SLOs/SLIs, driving postmortems, and partnering with product and engineering teams to improve operational maturity.
Top Skills:
AWSAzureGCPGoJavaKubernetesPython
Software
Drive SRE practices for VA enterprise healthcare platforms: automate infrastructure and CI/CD, define SLIs/SLOs, improve observability and reliability, support incident response, and ensure cloud-native, secure, compliant operations in AWS and containerized environments.
Top Skills:
AnsibleAWSBashCi/CdCloudwatchDockerEcsEksElkGoGrafanaInfrastructure As CodeKubernetesLinuxOpentelemetryPowershellPrometheusPythonSplunkTerraform
Artificial Intelligence • Information Technology • Software • Database
As a Site Reliability Engineer, you will design, implement, and maintain scalable infrastructure, ensure system reliability, automate processes, and collaborate with engineering teams.
Top Skills:
DockerElk StackGoGrafanaJavaKubernetesNode.jsPrometheusPulumiPythonRubyTerraform
New
Track Smarter, Apply Better.
Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.
Use For Free
Software
The Senior Site Reliability Engineer will lead service onboarding, maintain SLAs/SLOs, design secure infrastructure, automate operational tasks, and respond to incidents while ensuring system reliability and performance.
Top Skills:
AWSCloudFormationElk StackGoGrafanaHadoopKubernetesPythonTerraform
Artificial Intelligence • Automotive • Internet of Things • Software
The Senior Site Reliability Engineer will manage system health, automate solutions, resolve incidents, and collaborate across teams to enhance performance and reliability.
Top Skills:
APIsArmAzureAzure CliBicepCloud InfrastructureDevOpsGitPowershellTerraformVirtualization
Aerospace • Other
Build, operate, and scale mission-critical application platforms to accelerate vehicle software delivery. Manage infrastructure as code, improve observability, collaborate with developers, run on-call rotations, conduct blameless postmortems, and reduce performance bottlenecks to support Falcon, Starship, Dragon, and Starlink software lifecycles.
Top Skills:
AnsibleBazelBuckC#C++ClickhouseDockerJavaScriptKubernetesKvmLinuxMakeMySQLPostgresPuppetPythonQemuTerraformVsphere
Reposted 6 Days AgoSaved
Other • Social Impact
As a Senior Site Reliability Engineer, you will design, develop, and maintain reliable infrastructure for Wikimedia's API services, ensuring performance and availability while driving reliability engineering practices and improving developer experience.
Top Skills:
AnsibleArgocdAWSAzureGCPGitlabGoKubernetesOpentelemetryPrometheusPythonTerraform
Other • Social Impact
The Senior Site Reliability Engineer is responsible for maintaining Wikimedia's infrastructure, improving reliability, automating processes, and collaborating with teams. The role involves troubleshooting, managing deployments, and leading incident responses while working remotely.
Top Skills:
AnsibleBashCassandraDebianGoGrafanaHhvmKubernetesMariadbMemcachedPHPPrometheusPuppetPythonRedisRubyShell
Financial Services
Prototype, write, test, document, and deploy release automation across environments. Build and maintain pipelines, collaborate with engineers and product teams, troubleshoot issues, participate in on-call rotation, and improve software delivery, configuration, monitoring, and operations.
Top Skills:
AnsibleBashDockerGitlabJenkinsKubernetesMssqlPostgresPowershellPythonRedisTeamcity
Edtech
Maintain and improve site performance, uptime, and scalability. Build monitoring, alerting, runbooks, deployment tooling, and scalable architecture. Troubleshoot across the stack and partner with application teams to deliver reliable production systems.
Top Skills:
AWSBashCC++DockerGCPJavaKubernetesPerlPython
Healthtech • Software • Analytics • Business Intelligence
Lead and own reliability for critical backend and distributed systems: design, launch, on-call, incident leadership, SLO/SLI/error budget definition, automation to remove toil, observability improvement, resilience testing, mentoring, and cross-team reliability initiatives for production healthcare workflows.
Top Skills:
AWSAzureDockerGCPGithub ActionsGoGrafanaJavaKubernetesOpentelemetryPrometheusPythonTerraformTypescript
Hardware • Manufacturing
Lead implementation and operation of microservices on Kubernetes across multi-cloud environments. Build observability, run load/chaos tests, define SLOs/SLA/SLIs, automate with scripts, ensure security/compliance, lead incident response, perform DR planning, mentor teammates, and participate in on-call rotation.
Top Skills:
Application SecurityAWSAzureBashData ProtectionGCPGdprGoHpaIdentity And Access Management (Iam)Iso27001JavaJvmKubernetesMicroservicesNetwork SecurityObservabilityOciPowershellPythonSoc2
Aerospace • Hardware • Software • Biotech • Pharmaceutical • Manufacturing
Lead design, build, and operate mission-critical infrastructure across cloud, on-prem, and spacecraft contexts. Implement IaC, CI/CD, observability, and scalable Kubernetes-based systems; respond to incidents, perform root cause analysis, optimize performance, and collaborate with software and hardware teams. Participate in on-call rotations and occasional travel.
Top Skills:
AnsibleArgocdAzureBashCi/CdContainerdDatabasesDockerFirewallsGitopsGpu WorkloadsGrafanaHpcInfluxdbKubernetesLinuxPowershellPrometheusPythonSaltSlurmSubnetsTerraformVpcVpns
Healthtech • Social Impact • Software
Own the operational lifecycle of cloud-native data infrastructure: design and automate reliable deployments, observability, incident response, SLIs/SLOs, autoscaling and IaC, and improve platform efficiency and data freshness across GKE and Cloud Run.
Top Skills:
BashBigQueryCloud BuildCloud MonitoringCloud RunDatadogDockerGCPGithub ActionsGkeGoGrafanaJIRAKubernetesPrometheusPulumiPythonSentrySlackSnykSonarqubeTerraform
Information Technology • Legal Tech • Analytics
Design, build, and operate highly available AWS systems. Write and maintain Terraform, improve observability (Grafana, Pingdom, Uptrends), run on-call incident response, define SLOs/SLIs, build CI/CD with Azure DevOps/GitHub, automate operational work, document in Confluence, and mentor engineers.
Top Skills:
AWSAzure DevopsCi/CdConfluenceDockerGitGitGrafanaJIRAKubernetesLinuxPingdomServicenowTerraformUptrends
Information Technology • Legal Tech • Analytics
Design, deploy, and maintain highly available Kubernetes clusters on AWS EKS; manage and optimize cloud infrastructure; develop IaC and automation; implement CI/CD (GitHub Actions); monitor multi-region systems, troubleshoot incidents, perform root cause analysis; document best practices; and mentor junior engineers.
Top Skills:
AWSAws EksCi/CdContainersGithub ActionsInfrastructure As CodeKubernetesNewrelicPythonRbac
Let Your Resume Do The Work
Upload your resume to be matched with jobs you're a great fit for.
Success! We'll use this to further personalize your experience.
Top Companies Hiring Site Reliability Engineers
See AllPopular Job Searches
All Software Engineer Jobs
.NET Developer Jobs
Aerospace Thermal Engineering Jobs
AI Engineer Jobs
Android Developer Jobs
Automation Engineer Jobs
Backend Developer Jobs
Blockchain Developer Jobs
C# Jobs
C++ Jobs
Cloud Architect Jobs
Cloud Engineer Jobs
Design Engineer Jobs
DevOps Engineer Jobs
Director Of Engineering Jobs
Electrical Engineering Jobs
Embedded Software Engineer Jobs
Engineering Jobs
Engineering Manager Jobs
Environmental Engineering Jobs
Field Engineer Jobs
Front End Developer Jobs
Full Stack Developer Jobs
Game Developer Jobs
Golang Jobs
Hardware Engineer Jobs
Industrial Engineering Jobs
iOS Developer Jobs
Java Developer Jobs
Javascript Developer Jobs
Linux Jobs
Manufacturing Engineer Jobs
Mechanical Engineering Jobs
Network Engineer Jobs
PHP Developer Jobs
Process Engineer Jobs
Project Engineer Jobs
Prompt Engineering Jobs
Python Jobs
QA Jobs
Robotics Engineer Jobs
Ruby on Rails Jobs
Salesforce Administrator Jobs
Salesforce Developer Jobs
Scala Jobs
Sharepoint Developer Jobs
Site Reliability Engineer Jobs
Software Engineering Manager Jobs
Solutions Architect Jobs
SQL Developer Jobs
Structural Engineer Jobs
System Engineer Jobs
Test Engineer Jobs
Web Developer Jobs
All Filters
Total selected ()
No Results
No Results










.png)





















