Maximum of 25 job preferences reached.
Top Site Reliability Engineer Jobs
Cloud
The Staff Site Reliability Engineer will lead the design of AWS solutions, manage incident responses, and mentor junior engineers, ensuring reliability and security in federal environments.
Top Skills:
AWSDatabricksGoHelmKubernetesRedshiftSnowflakeTerraform
Cloud • Information Technology • Biotech
The Site Reliability Engineer will build and deploy Linux servers, research technologies, monitor system performance, and resolve technical incidents.
Top Skills:
Infrastructure-As-CodeLinuxNetworkingVirtualization
Artificial Intelligence • Machine Learning • Robotics • Software • Transportation • Design • Manufacturing
The Staff Site Reliability Engineer will lead source control strategy, manage Git-based monorepo operations, improve developer productivity, and oversee migrations to GitHub Cloud.
Top Skills:
BazelBuckBuildkiteGerritGithub ActionsGithub CloudGithub EnterpriseGitlab CiJenkinsPulumiReviewableTerraform
HR Tech • Information Technology
Looking for a Cloud SRE Engineer to ensure reliability and stability of cloud services, perform troubleshooting, and collaborate across teams. Bilingual in Mandarin preferred.
Top Skills:
AWSAzureCloudwatchGCPGoGrafanaKubernetesPrometheusPythonShell
Artificial Intelligence • Machine Learning • Software • Analytics
The role involves end-to-end ownership of AWS infrastructure, managing Kubernetes platforms, and ensuring system reliability through observability and automation. Responsibilities include incident response and maintaining CI/CD systems.
Top Skills:
ArgocdAWSDatadogGitGoKubernetesPythonTerraform
Aerospace • Other
The Site Reliability Engineer will manage and maintain mission-critical applications, improve software development processes, and provide end-user support, emphasizing safety and performance optimization.
Top Skills:
AnsibleBazelBuckC#C++ClickhouseDockerJavaScriptKubernetesLinuxMakeMySQLPostgresPuppetPythonTerraform
Software • Consulting
The Senior Application Support Engineer leads efforts to ensure application reliability, manages incidents, collaborates with teams, and monitors performance, providing 24/7 support.
Top Skills:
AppdynamicsAWSDatadogLinuxMulesoftOpentelemetryPythonServicenowSplunk
Fintech • Payments • Financial Services
The Site Reliability Engineer will assist clients with Redline products, manage production environments, troubleshoot issues, and ensure automation and customer satisfaction.
Top Skills:
C/C++JavaLinuxPython
Reposted 3 Days AgoSaved
Financial Services
As a Principal Site Reliability Engineer, you'll lead a team focusing on observability and automating solutions for cloud and on-prem infrastructures, enhancing reliability and incident response across T. Rowe Price's tech ecosystem.
Top Skills:
.Net CoreAmazon AwsAnsibleElastic StackGoGrafanaJavaMySQLNew RelicNode.jsPostgresPrometheusPythonSolarwinds DpaSplunkSQL ServerTerraformVagrantVault
Reposted 3 Days AgoSaved
Edtech
The Site Reliability Engineer enhances application deployment in AWS, monitors systems, improves automation, and collaborates with teams on security and performance.
Top Skills:
AnsibleAWSCloudFormationCSSDockerGithub ActionsGoHTMLInfrastructure As CodeJavaJavaScriptJenkinsKubernetesPythonTerraformTypescript
Automotive • Information Technology • Logistics • Software
Lead Site Reliability Engineer implements IaC and automation, builds observability (SLIs/SLOs, dashboards, alerting), manages incident response, runbooks, gamedays, postmortems, and drives SRE/DevOps best practices, AppSec integration, testing, and CI/CD improvements across teams.
Top Skills:
AppsecAWSAws CloudformationC#Ci/CdCloudsploitCloudwatchData TheoremDatadogGrafanaIacInfrastructure As CodeJavaNewrelicPythonTerraformVeracode
Fintech • Financial Services
Lead SRE responsible for reliability, scalability, and performance of systems. Design automated deployments, build and govern monitoring/observability, define SLIs/KPIs, collaborate across teams to improve release and delivery processes, and participate in on-call incident response.
Top Skills:
AlertingAWSAzureBashGCPLoggingMessaging/Event BusMetricsMonitoringObservabilityPowershellPython
New
Cut your apply time in half.
Use ourAI Assistantto automatically fill your job applications.
Use For Free
Fintech
Lead SRE work partnering with development teams to design and implement availability, scalability, observability, and automation for production systems. Build tooling, manage incident response and RCAs, optimize capacity and performance, mentor engineers, maintain runbooks, and participate in a 24x7 on-call rotation.
Top Skills:
AuroraAWSChefCi/CdDockerDynamoDBGitGoIpJavaJavaScriptJenkinsJmsKafkaKubernetesLinuxMavenMemcachedMicroservicesObservabilityOraclePythonRedisRubySqsSwarmTcpUdp
Software
Ensure availability, performance, and reliability of a federal cloud platform. Monitor platform health and SLOs, build observability (metrics, logging, alerting, dashboards), participate in on-call and incident response, run postmortems, automate operational toil, support capacity planning and performance tuning on AWS/EKS, implement infrastructure as code with Terraform, and collaborate with application teams and government partners.
Top Skills:
Amazon EksAWSKubernetesTerraform
Financial Services
Design, build, and operate reliable cloud infrastructure and networking (multi-account AWS, VPC, IAM). Implement IaC, CI/CD pipelines, observability (logging/metrics/alerting), automation, and reliability guardrails. Provide production support and incident response, perform root cause analysis, and collaborate with application teams to co-own system design and continuous improvement, using AI-assisted tools where appropriate.
Top Skills:
.NetAi-Assisted Tools (Claude CodeAWSAws OrganizationsBashCi/CdCloudFormationElastic StackGitGithub CopilotIamInfrastructure As CodeJavaJenkinsNode.jsObservabilityOpensearchPowershellPythonTerraformVpcWindsurf)
Security • Software • Cybersecurity
Hands-on Site Reliability Engineer responsible for building and maintaining cloud infrastructure, CI/CD pipelines, observability (logging/monitoring/tracing), automation, and security best practices. Manage datacenter resources, troubleshoot clusters and services, collaborate with engineering teams for deployments, and participate in on-call incident response to ensure high availability and performance.
Top Skills:
AnsibleArgocdBashChefDatadogElkGitlab CiGoGrafanaJenkinsKubernetesLinuxPrometheusPythonRancher
Fintech • Analytics
The Director of SRE leads a global SRE organization, driving operational excellence, incident management, automation, and reliability across financial systems while mentoring teams and improving collaboration with stakeholders.
Top Skills:
Agentic AiAnsibleApi GatewayAuroraAWSC#.NetCi/CdCloudwatchDatadogDynamoDBEcsEksElkGitGoGrafanaJavaLambdaLinuxOpentelemetryPostgresPrometheusPythonPythonSQL ServerSybaseTerraformUnix
Reposted 9 Days AgoSaved
Cloud • Software
Responsible for maintaining FedRAMP-compliant infrastructure, collaborating with software engineers, and ensuring system availability and security. Duties include infrastructure design, automation, monitoring, and incident response.
Top Skills:
AWSGoKubernetesPuppetPythonTerraform
Security
Maintain and improve reliability, scalability, and performance of distributed systems. Build and manage infrastructure as code, support cloud and Kubernetes environments, implement observability and monitoring, participate in incident response and on-call rotations, and collaborate with cross-functional teams to drive operational excellence.
Top Skills:
AnsibleAWSBashCi/CdDynatraceGCPJavaKubernetesPrometheusPythonTerraform
Artificial Intelligence • Cloud • Information Technology • Software
The Site Reliability Engineer will provision and manage Kubernetes clusters, build automation tools, debug customer issues, and improve infrastructure reliability.
Top Skills:
AnsibleBashDatadogGoGrafanaHelmKubernetesLokiPrometheusPythonTerraform
Automotive • eCommerce • Retail • Sales
Lead SRE enablement by defining SLO/SLO frameworks, production readiness, and reliability playbooks. Build and standardize observability (Dynatrace), provide alerting/dashboard/runbook templates, coach teams on SRE practices, run training, participate in incident post-mortems, and report enterprise reliability metrics while advising on architecture for hybrid GCP and on-prem environments.
Top Skills:
AnsibleApmDynatraceGoGoogle Cloud Platform (Gcp)JavaKubernetesObservabilityPythonTerraform
Edtech • Information Technology • Software
Lead infrastructure, reliability, and observability across multi-cloud environments. Improve CI/CD, IaC standards, staging parity, Kubernetes operations, monitoring and SLOs, incident response, and platform modernization while partnering with engineering teams.
Top Skills:
Ai-Assisted Development Tools (Claude CodeAutoscalingAWSCi/Cd PipelinesCodex)Event-Driven ArchitecturesGCPIncident ManagementInfrastructure-As-CodeKubernetesMonitoringObservabilityPythonQueue-Based ArchitecturesRuby On RailsSlo FrameworksTerraform
Information Technology • Software
Seek an SRE/Network Engineer with deep MAAS and bare-metal automation expertise to manage hundreds of nodes across distributed sites. Responsibilities include Linux administration, hardware-level diagnostics (BIOS/IPMI/RAID), network design (VLANs/L2-L3/VPN/UniFi), infrastructure automation (Ansible, Bash/Python, Git), observability (Prometheus/Grafana, ELK/Graylog/Loki), PXE/MAAS-based OS provisioning, API integrations, virtualization (OpenStack/Kolla-Ansible, Proxmox, VMware), and container workload support.
Top Skills:
AnsibleBashBiosCloud-InitCloudflare ApiDebianElkGitGitopsGrafanaGraylogIpmiIronicKolla-AnsibleL2 RoutingL3 RoutingLinuxLokiMaasOpenstackPreseedPrometheusProxmox VePxePythonRaidUbuntuUnifiVlanVmware EsxiVpn
Artificial Intelligence • Information Technology • Consulting
Build and operate Nebius's network infrastructure: define SLIs/SLOs, improve site and inter-site reliability, lead incident response and postmortems, develop observability and alerting, automate change workflows, and collaborate with network and platform teams to embed operability.
Top Skills:
Ci/CdContainer PlatformsGoInfrastructure As CodeLinuxPython
Cloud • Security • Software • Cybersecurity
Design, develop, test, and operate scalable infrastructure and services for Akamai Cloud. Implement and manage Infrastructure-as-Code (Terraform and similar tools), CI/CD, and observability. Automate reliability improvements, mentor engineers, collaborate on incident response and root-cause remediation, and participate in on-call rotations.
Top Skills:
Alerting)AnsibleChefCi/CdInfrastructure As CodeLinuxLoggingObservability (MonitoringPuppetSaltstackTerraform
Let Your Resume Do The Work
Upload your resume to be matched with jobs you're a great fit for.
Success! We'll use this to further personalize your experience.
Top Companies Hiring Site Reliability Engineers
See AllPopular Job Searches
All Software Engineer Jobs
.NET Developer Jobs
Aerospace Thermal Engineering Jobs
AI Engineer Jobs
Android Developer Jobs
Automation Engineer Jobs
Backend Developer Jobs
Blockchain Developer Jobs
C# Jobs
C++ Jobs
Cloud Architect Jobs
Cloud Engineer Jobs
Design Engineer Jobs
DevOps Engineer Jobs
Director Of Engineering Jobs
Electrical Engineering Jobs
Embedded Software Engineer Jobs
Engineering Jobs
Engineering Manager Jobs
Environmental Engineering Jobs
Field Engineer Jobs
Front End Developer Jobs
Full Stack Developer Jobs
Game Developer Jobs
Golang Jobs
Hardware Engineer Jobs
Industrial Engineering Jobs
iOS Developer Jobs
Java Developer Jobs
Javascript Developer Jobs
Linux Jobs
Manufacturing Engineer Jobs
Mechanical Engineering Jobs
Network Engineer Jobs
PHP Developer Jobs
Process Engineer Jobs
Project Engineer Jobs
Prompt Engineering Jobs
Python Jobs
QA Jobs
Robotics Engineer Jobs
Ruby on Rails Jobs
Salesforce Administrator Jobs
Salesforce Developer Jobs
Scala Jobs
Sharepoint Developer Jobs
Site Reliability Engineer Jobs
Software Engineering Manager Jobs
Solutions Architect Jobs
SQL Developer Jobs
Structural Engineer Jobs
System Engineer Jobs
Test Engineer Jobs
Web Developer Jobs
All Filters
Total selected ()
No Results
No Results








.png)






.jpg)


















