Top Site Reliability Engineer Jobs

Reposted 3 Days AgoSaved
In-Office
Washington, DC, USA
188K-259K Annually
Senior level
188K-259K Annually
Senior level
Cloud
The Staff Site Reliability Engineer will lead the design of AWS solutions, manage incident responses, and mentor junior engineers, ensuring reliability and security in federal environments.
Top Skills: AWSDatabricksGoHelmKubernetesRedshiftSnowflakeTerraform
Reposted 3 Days AgoSaved
In-Office
Cambridge, MA, USA
Mid level
Mid level
Cloud • Information Technology • Biotech
The Site Reliability Engineer will build and deploy Linux servers, research technologies, monitor system performance, and resolve technical incidents.
Top Skills: Infrastructure-As-CodeLinuxNetworkingVirtualization
Reposted 3 Days AgoSaved
Hybrid
Foster City, CA, USA
250K-300K Annually
Senior level
250K-300K Annually
Senior level
Artificial Intelligence • Machine Learning • Robotics • Software • Transportation • Design • Manufacturing
The Staff Site Reliability Engineer will lead source control strategy, manage Git-based monorepo operations, improve developer productivity, and oversee migrations to GitHub Cloud.
Top Skills: BazelBuckBuildkiteGerritGithub ActionsGithub CloudGithub EnterpriseGitlab CiJenkinsPulumiReviewableTerraform
Reposted 3 Days AgoSaved
In-Office
Palo Alto, CA, USA
70-100 Hourly
Entry level
70-100 Hourly
Entry level
HR Tech • Information Technology
Looking for a Cloud SRE Engineer to ensure reliability and stability of cloud services, perform troubleshooting, and collaborate across teams. Bilingual in Mandarin preferred.
Top Skills: AWSAzureCloudwatchGCPGoGrafanaKubernetesPrometheusPythonShell
Reposted 3 Days AgoSaved
Remote or Hybrid
4 Locations
160K-180K Annually
Senior level
160K-180K Annually
Senior level
Artificial Intelligence • Machine Learning • Software • Analytics
The role involves end-to-end ownership of AWS infrastructure, managing Kubernetes platforms, and ensuring system reliability through observability and automation. Responsibilities include incident response and maintaining CI/CD systems.
Top Skills: ArgocdAWSDatadogGitGoKubernetesPythonTerraform
Reposted 3 Days AgoSaved
In-Office
Hawthorne, CA, USA
125K-175K Annually
Mid level
125K-175K Annually
Mid level
Aerospace • Other
The Site Reliability Engineer will manage and maintain mission-critical applications, improve software development processes, and provide end-user support, emphasizing safety and performance optimization.
Top Skills: AnsibleBazelBuckC#C++ClickhouseDockerJavaScriptKubernetesLinuxMakeMySQLPostgresPuppetPythonTerraform
Reposted 3 Days AgoSaved
Remote
United States
Mid level
Mid level
Software • Consulting
The Senior Application Support Engineer leads efforts to ensure application reliability, manages incidents, collaborates with teams, and monitors performance, providing 24/7 support.
Top Skills: AppdynamicsAWSDatadogLinuxMulesoftOpentelemetryPythonServicenowSplunk
Reposted 3 Days AgoSaved
In-Office
New York, NY, USA
115K-125K Annually
Mid level
115K-125K Annually
Mid level
Fintech • Payments • Financial Services
The Site Reliability Engineer will assist clients with Redline products, manage production environments, troubleshoot issues, and ensure automation and customer satisfaction.
Top Skills: C/C++JavaLinuxPython
Reposted 3 Days AgoSaved
In-Office
Owings Mills, MD, USA
159K-339K Annually
Senior level
159K-339K Annually
Senior level
Financial Services
As a Principal Site Reliability Engineer, you'll lead a team focusing on observability and automating solutions for cloud and on-prem infrastructures, enhancing reliability and incident response across T. Rowe Price's tech ecosystem.
Top Skills: .Net CoreAmazon AwsAnsibleElastic StackGoGrafanaJavaMySQLNew RelicNode.jsPostgresPrometheusPythonSolarwinds DpaSplunkSQL ServerTerraformVagrantVault
Reposted 3 Days AgoSaved
In-Office or Remote
Franklin, TN, USA
Mid level
Mid level
Edtech
The Site Reliability Engineer enhances application deployment in AWS, monitors systems, improves automation, and collaborates with teams on security and performance.
Top Skills: AnsibleAWSCloudFormationCSSDockerGithub ActionsGoHTMLInfrastructure As CodeJavaJavaScriptJenkinsKubernetesPythonTerraformTypescript
4 Days AgoSaved
In-Office
Austin, TX, USA
167K-204K Annually
Senior level
167K-204K Annually
Senior level
Automotive • Information Technology • Logistics • Software
Lead Site Reliability Engineer implements IaC and automation, builds observability (SLIs/SLOs, dashboards, alerting), manages incident response, runbooks, gamedays, postmortems, and drives SRE/DevOps best practices, AppSec integration, testing, and CI/CD improvements across teams.
Top Skills: AppsecAWSAws CloudformationC#Ci/CdCloudsploitCloudwatchData TheoremDatadogGrafanaIacInfrastructure As CodeJavaNewrelicPythonTerraformVeracode
4 Days AgoSaved
In-Office
Los Angeles, CA, USA
140K-199K Annually
Senior level
140K-199K Annually
Senior level
Fintech • Financial Services
Lead SRE responsible for reliability, scalability, and performance of systems. Design automated deployments, build and govern monitoring/observability, define SLIs/KPIs, collaborate across teams to improve release and delivery processes, and participate in on-call incident response.
Top Skills: AlertingAWSAzureBashGCPLoggingMessaging/Event BusMetricsMonitoringObservabilityPowershellPython
New

Cut your apply time in half.

Use ourAI Assistantto automatically fill your job applications.

Use For Free
Application Tracker Preview
4 Days AgoSaved
In-Office
3 Locations
116K-174K Annually
Senior level
116K-174K Annually
Senior level
Fintech
Lead SRE work partnering with development teams to design and implement availability, scalability, observability, and automation for production systems. Build tooling, manage incident response and RCAs, optimize capacity and performance, mentor engineers, maintain runbooks, and participate in a 24x7 on-call rotation.
Top Skills: AuroraAWSChefCi/CdDockerDynamoDBGitGoIpJavaJavaScriptJenkinsJmsKafkaKubernetesLinuxMavenMemcachedMicroservicesObservabilityOraclePythonRedisRubySqsSwarmTcpUdp
4 Days AgoSaved
In-Office
McLean, VA, USA
125K-135K Annually
Senior level
125K-135K Annually
Senior level
Software
Ensure availability, performance, and reliability of a federal cloud platform. Monitor platform health and SLOs, build observability (metrics, logging, alerting, dashboards), participate in on-call and incident response, run postmortems, automate operational toil, support capacity planning and performance tuning on AWS/EKS, implement infrastructure as code with Terraform, and collaborate with application teams and government partners.
Top Skills: Amazon EksAWSKubernetesTerraform
4 Days AgoSaved
In-Office
2 Locations
140K-170K Annually
Senior level
140K-170K Annually
Senior level
Financial Services
Design, build, and operate reliable cloud infrastructure and networking (multi-account AWS, VPC, IAM). Implement IaC, CI/CD pipelines, observability (logging/metrics/alerting), automation, and reliability guardrails. Provide production support and incident response, perform root cause analysis, and collaborate with application teams to co-own system design and continuous improvement, using AI-assisted tools where appropriate.
Top Skills: .NetAi-Assisted Tools (Claude CodeAWSAws OrganizationsBashCi/CdCloudFormationElastic StackGitGithub CopilotIamInfrastructure As CodeJavaJenkinsNode.jsObservabilityOpensearchPowershellPythonTerraformVpcWindsurf)
4 Days AgoSaved
In-Office
Sunnyvale, CA, USA
170K-200K Annually
Senior level
170K-200K Annually
Senior level
Security • Software • Cybersecurity
Hands-on Site Reliability Engineer responsible for building and maintaining cloud infrastructure, CI/CD pipelines, observability (logging/monitoring/tracing), automation, and security best practices. Manage datacenter resources, troubleshoot clusters and services, collaborate with engineering teams for deployments, and participate in on-call incident response to ensure high availability and performance.
Top Skills: AnsibleArgocdBashChefDatadogElkGitlab CiGoGrafanaJenkinsKubernetesLinuxPrometheusPythonRancher
Reposted 4 Days AgoSaved
In-Office
2 Locations
Expert/Leader
Expert/Leader
Fintech • Analytics
The Director of SRE leads a global SRE organization, driving operational excellence, incident management, automation, and reliability across financial systems while mentoring teams and improving collaboration with stakeholders.
Top Skills: Agentic AiAnsibleApi GatewayAuroraAWSC#.NetCi/CdCloudwatchDatadogDynamoDBEcsEksElkGitGoGrafanaJavaLambdaLinuxOpentelemetryPostgresPrometheusPythonPythonSQL ServerSybaseTerraformUnix
Reposted 9 Days AgoSaved
Hybrid
3 Locations
147K-278K Annually
Senior level
147K-278K Annually
Senior level
Cloud • Software
Responsible for maintaining FedRAMP-compliant infrastructure, collaborating with software engineers, and ensuring system availability and security. Duties include infrastructure design, automation, monitoring, and incident response.
Top Skills: AWSGoKubernetesPuppetPythonTerraform
4 Days AgoSaved
In-Office
Blue Bell, PA, USA
Mid level
Mid level
Security
Maintain and improve reliability, scalability, and performance of distributed systems. Build and manage infrastructure as code, support cloud and Kubernetes environments, implement observability and monitoring, participate in incident response and on-call rotations, and collaborate with cross-functional teams to drive operational excellence.
Top Skills: AnsibleAWSBashCi/CdDynatraceGCPJavaKubernetesPrometheusPythonTerraform
Reposted 4 Days AgoSaved
In-Office or Remote
8 Locations
Senior level
Senior level
Artificial Intelligence • Cloud • Information Technology • Software
The Site Reliability Engineer will provision and manage Kubernetes clusters, build automation tools, debug customer issues, and improve infrastructure reliability.
Top Skills: AnsibleBashDatadogGoGrafanaHelmKubernetesLokiPrometheusPythonTerraform
5 Days AgoSaved
In-Office
Memphis, TN, USA
Senior level
Senior level
Automotive • eCommerce • Retail • Sales
Lead SRE enablement by defining SLO/SLO frameworks, production readiness, and reliability playbooks. Build and standardize observability (Dynatrace), provide alerting/dashboard/runbook templates, coach teams on SRE practices, run training, participate in incident post-mortems, and report enterprise reliability metrics while advising on architecture for hybrid GCP and on-prem environments.
Top Skills: AnsibleApmDynatraceGoGoogle Cloud Platform (Gcp)JavaKubernetesObservabilityPythonTerraform
5 Days AgoSaved
In-Office or Remote
The Center, IN, USA
180K-250K Annually
Expert/Leader
180K-250K Annually
Expert/Leader
Edtech • Information Technology • Software
Lead infrastructure, reliability, and observability across multi-cloud environments. Improve CI/CD, IaC standards, staging parity, Kubernetes operations, monitoring and SLOs, incident response, and platform modernization while partnering with engineering teams.
Top Skills: Ai-Assisted Development Tools (Claude CodeAutoscalingAWSCi/Cd PipelinesCodex)Event-Driven ArchitecturesGCPIncident ManagementInfrastructure-As-CodeKubernetesMonitoringObservabilityPythonQueue-Based ArchitecturesRuby On RailsSlo FrameworksTerraform
5 Days AgoSaved
Remote or Hybrid
Chicago, IL, USA
Senior level
Senior level
Information Technology • Software
Seek an SRE/Network Engineer with deep MAAS and bare-metal automation expertise to manage hundreds of nodes across distributed sites. Responsibilities include Linux administration, hardware-level diagnostics (BIOS/IPMI/RAID), network design (VLANs/L2-L3/VPN/UniFi), infrastructure automation (Ansible, Bash/Python, Git), observability (Prometheus/Grafana, ELK/Graylog/Loki), PXE/MAAS-based OS provisioning, API integrations, virtualization (OpenStack/Kolla-Ansible, Proxmox, VMware), and container workload support.
Top Skills: AnsibleBashBiosCloud-InitCloudflare ApiDebianElkGitGitopsGrafanaGraylogIpmiIronicKolla-AnsibleL2 RoutingL3 RoutingLinuxLokiMaasOpenstackPreseedPrometheusProxmox VePxePythonRaidUbuntuUnifiVlanVmware EsxiVpn
5 Days AgoSaved
Remote
United States
180K-224K Annually
Senior level
180K-224K Annually
Senior level
Artificial Intelligence • Information Technology • Consulting
Build and operate Nebius's network infrastructure: define SLIs/SLOs, improve site and inter-site reliability, lead incident response and postmortems, develop observability and alerting, automate change workflows, and collaborate with network and platform teams to embed operability.
Top Skills: Ci/CdContainer PlatformsGoInfrastructure As CodeLinuxPython
5 Days AgoSaved
In-Office or Remote
2 Locations
76K-136K Annually
Mid level
76K-136K Annually
Mid level
Cloud • Security • Software • Cybersecurity
Design, develop, test, and operate scalable infrastructure and services for Akamai Cloud. Implement and manage Infrastructure-as-Code (Terraform and similar tools), CI/CD, and observability. Automate reliability improvements, mentor engineers, collaborate on incident response and root-cause remediation, and participate in on-call rotations.
Top Skills: Alerting)AnsibleChefCi/CdInfrastructure As CodeLinuxLoggingObservability (MonitoringPuppetSaltstackTerraform
All Filters
JobType
New Jobs
Job Category
Experience
Industry
Company Name
Company Size

Sign up now Access later

Create Free Account