Top Site Reliability Engineer Jobs

10 Days AgoSaved
In-Office
Birmingham, AL, USA
Mid level
Mid level
Automotive • Hardware • Logistics
Builds and supports large-scale, distributed, fault-tolerant systems to improve reliability and automation. Administers networks and databases, monitors system health, configures load and data communications, coordinates equipment and vendor orders, and participates in change management to reduce incidents and support cloud transformations.
Top Skills: CloudDistributed ComputingInternet SecurityMonitoring ToolsOracle ErpUnixVersion Control SystemsWindows 2000Windows 98Windows Nt
10 Days AgoSaved
In-Office
McLean, VA, USA
87K-198K Annually
Senior level
87K-198K Annually
Senior level
Information Technology
Design and build resilient infrastructure, implement monitoring and SLIs/SLOs, automate operations and self-healing, reduce toil with scripting, support enterprise-scale application reliability, act as subject matter expert for engineering teams, and meet government vetting and U.S. citizenship requirements.
Top Skills: AWSCi/CdCloud-NativeCloudtrailCloudwatchGitGithub ActionsGitlab RunnersItsiJenkinsLinuxMicroservicesPaasPagerdutySaaSSplunkUnix
10 Days AgoSaved
Hybrid
Rockville, MD, USA
112K-150K Annually
Mid level
112K-150K Annually
Mid level
Artificial Intelligence • Cloud • Software • Cybersecurity
Operate and tune AWS environments to meet SLAs, build observability and alerts, automate infrastructure with IaC and CI/CD, define SLIs/SLOs, support security/compliance within a FISMA Moderate boundary, design resilience and DR plans, and own incident response and post-mortems.
Top Skills: AnsibleAWSAws CloudwatchAws Trusted AdvisorCi/CdCloudFormationDockerGitlab CiJenkinsNew RelicPythonSplunkTerraform
Reposted 10 Days AgoSaved
In-Office
3 Locations
Junior
Junior
Fintech
Build production-quality software to improve reliability, reduce operational toil, and scale systems. Own end-to-end features, participate in on-call rotations, analyze incidents, implement observability, and build automations using Node.js/TypeScript, Python, and AI-assisted tools.
Top Skills: Ai-Assisted Development ToolsAWSAzureCi/CdGithub CopilotJavaScriptNode.jsPowershellPythonSQLTypescriptVs Code
Reposted 10 Days AgoSaved
Remote
6 Locations
Mid level
Mid level
Information Technology • Software • Consulting
Join Solvd as an Infrastructure/SRE Engineer to design, manage cloud infrastructure, build CI/CD pipelines, automate deployments, and ensure system reliability through observability and performance tuning.
Top Skills: ArgocdAWSAzureBashDatadogDockerFluxGCPGithub ActionsGitlab CiGoGrafanaJenkinsKubernetesMemcachedNew RelicOpentofuPostgresPrometheusPythonRdsRedisTerraform
Reposted 10 Days AgoSaved
Remote
United States
Senior level
Senior level
Digital Media • Social Media • Software • Sports
Lead the technical architecture and execution of migration to AWS, drive developer enablement, and automate infrastructure using code-first principles.
Top Skills: Aws EksDatadogGithub ActionsGoIstioK6KubernetesNode.jsTerraform
Reposted 10 Days AgoSaved
In-Office
Miami, FL, USA
Senior level
Senior level
Healthtech
The Senior Software Engineer will enhance system reliability, manage Kubernetes and AWS environments, oversee incident responses, and implement observability measures.
Top Skills: AWSCloudwatchElbGithub ActionsKubernetesObservability ToolingTerraformVpc
Reposted 10 Days AgoSaved
In-Office
St. Louis, MO, USA
100K-120K Annually
Senior level
100K-120K Annually
Senior level
Fintech • Analytics
As a Senior Site Reliability Engineer, you'll lead incident recovery, enhance production stability, automate processes, and collaborate with development teams to improve operational efficiency.
Top Skills: AWSAzureBigpandaCloud-Native ApplicationsDatadogDnsDockerGitHTTPKubernetesShell ScriptingTcp/IpUnix
Reposted 10 Days AgoSaved
In-Office
St. Louis, MO, USA
Senior level
Senior level
Fintech • Analytics
The Site Reliability Engineer will support and automate critical Real Time applications, ensuring service availability and quality across cloud and on-premise deployments, while also collaborating with various teams on operational documentation and incident management.
Top Skills: AWSAzureDatadogDockerGitKubernetesPythonUnix/Linux
Reposted 10 Days AgoSaved
In-Office
Seattle, WA, USA
Senior level
Senior level
Cloud • Software • Database
The Site Reliability Engineer will optimize and scale managed services across cloud providers, automate infrastructure, enhance monitoring, and ensure system reliability.
Top Skills: AWSAzureBashGCPGrafanaKubernetesLokiMimirPrometheusPython
Reposted 10 Days AgoSaved
Hybrid
3 Locations
245K-270K Annually
Senior level
245K-270K Annually
Senior level
Information Technology • Consulting
As a Senior Staff Site Reliability Engineer, you will lead the SRE team, advocate best practices, ensure resilience in cloud architecture, and mentor team members.
Top Skills: ArgocdCircleCIGoogle Cloud PlatformKubernetesPulumiTerraformTypescript
Reposted 10 Days AgoSaved
Remote
USA
156K-288K Annually
Mid level
156K-288K Annually
Mid level
Computer Vision • Machine Learning • Software
As a Site Reliability Engineer, ensure the reliability, performance, and scalability of Ditto's cloud infrastructure by developing observability solutions, leading incident management, and collaborating with product engineering teams.
Top Skills: AWSAzureCDatadogGCPGoGrafanaHelmJavaKubernetesPrometheusRustTerraform
New

Cut your apply time in half.

Use ourAI Assistantto automatically fill your job applications.

Use For Free
Application Tracker Preview
Reposted 10 Days AgoSaved
In-Office
San Francisco, CA, USA
Senior level
Senior level
Artificial Intelligence • Software
As a Site Reliability Engineer at Mercor, you will ensure production reliability, develop SRE function, and collaborate with engineering teams to maintain system performance.
Top Skills: AWSKubernetesSpaceliftTerraform
Reposted 10 Days AgoSaved
In-Office
San Francisco, CA, USA
Mid level
Mid level
Enterprise Web • Information Technology • Software
As a Platform Engineer, you will enhance reliability and performance, design operational processes, and build monitoring systems while collaborating with a talented team.
Top Skills: AIAssistantsBackendDeveloper ToolsFrontendInfrastructureMcpsMonitoring SystemsSkills
Reposted 10 Days AgoSaved
In-Office
Wacker, IL, USA
132K-220K Annually
Expert/Leader
132K-220K Annually
Expert/Leader
Financial Services
The Staff Site Reliability Engineer will lead Platform Engineering's SRE efforts by defining technical strategy, overseeing architecture, and enhancing operational excellence through mentorship and governance.
Top Skills: ArgocdGCPGkeGoKafkaNode.jsPythonTerraform
Reposted 10 Days AgoSaved
In-Office
Burlingame, CA, USA
170K-197K Annually
Mid level
170K-197K Annually
Mid level
Aerospace • Artificial Intelligence
The Site Reliability Engineer will architect and manage ground infrastructure for satellite systems, ensuring high availability, automating deployments, and optimizing data management systems.
Top Skills: AnsibleAWSAzureC++CloudFormationEksElkGCPGrafanaHelmKubernetesPrometheusPythonTerraform
Reposted 10 Days AgoSaved
In-Office
San Francisco, CA, USA
Mid level
Mid level
Software
Join a passionate team to enhance reliability and performance of the AI control plane, manage deployments, and respond to production incidents while ensuring service quality for customers.
Top Skills: Ai Control PlaneDeveloper ToolsInfrastructure
Reposted 10 Days AgoSaved
In-Office
Houston, TX, USA
Mid level
Mid level
Other • Energy
The Site Reliability Engineer will build and maintain reliable systems on Google Cloud Platform, automate operations, and improve system performance and reliability.
Top Skills: AirflowBigQueryCloud MonitoringDataflowDatastreamDockerGithub ActionsGitlab CiGoGoogle Cloud PlatformGrafanaIamJavaKubernetesPrometheusPythonTerraform
Reposted 10 Days AgoSaved
Hybrid
3 Locations
100K-115K Annually
Mid level
100K-115K Annually
Mid level
AdTech • Big Data • Marketing Tech • Software
Responsible for owning and optimizing the Internal Developer Platform, improving reliability, scalability, and usability while supporting engineering teams and standardizing operational processes through automation and best practices.
Top Skills: ArmAWSAzureBashCloudFormationConsulDockerGithub ActionsHashicorpJenkinsKubernetesLinuxNomadPowershellPythonSplunkSumo LogicTerraformVaultWindows
Reposted 10 Days AgoSaved
Hybrid
Atlanta, GA, USA
Mid level
Mid level
Fintech • Payments • Financial Services
Build, operate, and scale AWS-based infrastructure using IaC (Terraform), manage EKS and serverless environments, create CI/CD pipelines, implement observability (OpenTelemetry/Prometheus/New Relic), support Postgres/RDS (Aurora), lead incident response and define SRE practices (SLIs/SLOs/error budgets).
Top Skills: AuroraAWSAws RdsAzureCloudFormationEcsEksGithub ActionsGitlabGoGCPJavaKubernetesNew RelicOpentelemetryOpentofuPostgresPrometheusPythonRubyServerlessTerraformTerragrunt
11 Days AgoSaved
Remote or Hybrid
United States
150K-225K Annually
Senior level
150K-225K Annually
Senior level
Artificial Intelligence • Fintech • Machine Learning • Natural Language Processing • Business Intelligence
Lead architecture and implementation of reliability platforms and SRE practices for a production SaaS. Build self-service reliability tooling, drive AIOps automation, advance observability (monitoring, tracing, profiling), lead incident response and postmortems, mentor engineers, and embed production readiness across teams to achieve 99.99% uptime.
Top Skills: AWSAzureContinuous ProfilingDatadogDnsElkGCPGoGrafanaHttp/SKubernetesLoad BalancingOpentelemetryPrometheusPythonTcp/Ip
11 Days AgoSaved
In-Office or Remote
Basel, KS, USA
125K-145K Annually
Mid level
125K-145K Annually
Mid level
Software
Operate and improve Accela's cloud-based SaaS platform to ensure availability, performance, security, and scalability. Build automation and tooling, monitor observability and SLOs, participate in incident response and RCA, support deployments and change management, and help maintain compliance for regulated environments.
Top Skills: AnsibleArgo CdBashClaude CodeFluxGitGitGithub CopilotKubernetesLinuxAzureOpentelemetryPowershellPythonTerraform
11 Days AgoSaved
In-Office or Remote
3 Locations
176K-176K Annually
Entry level
176K-176K Annually
Entry level
AdTech • Beauty • Marketing Tech • Retail • Pharmaceutical
Lead incident response and root cause analysis, maintain platform reliability and performance, implement and improve observability solutions, collaborate with vendor teams, and contribute to continuous improvement of incident management and operational processes.
Top Skills: DatabricksGrafanaPrometheusSpyglass
11 Days AgoSaved
Remote
Illinois, USA
150K-224K Annually
Senior level
150K-224K Annually
Senior level
Legal Tech • Software
Lead Site Reliability Engineer responsible for platform availability and reliability of RelativityOne. Drive SRE best practices, build tools, lead projects, coach SREs, work with stakeholders, support incidents, run postmortems, and improve monitoring, automation, and operational efficiency.
Top Skills: Ci/CdDevOpsJenkinsJIRAKubernetesAzureMonitoring And AlertingNew RelicNoSQLPowershellRelativity ServerRelativityoneSQLTableau
Reposted 16 Days AgoSaved
Hybrid
Denver, CO, USA
110K-145K Annually
Senior level
110K-145K Annually
Senior level
Information Technology • Insurance • Software
Responsible for the reliability and performance of production services, managing SLIs and SLOs, and leading incident responses while collaborating with various teams.
Top Skills: .NetAWSC#Ci/CdJavaKubernetesLinuxPythonReactWindows
All Filters
JobType
New Jobs
Job Category
Experience
Industry
Company Name
Company Size

Sign up now Access later

Create Free Account