Maximum of 25 job preferences reached.
Top Site Reliability Engineer Jobs
Big Data • Cloud • Digital Media • Machine Learning • Mobile • Software • Industrial
Lead reliability for production services in Autodesk GovCloud by building automation, SLO/SLI practices, observability, incident response, resilience testing, and runbooks. Deploy, operate, and improve cloud services while ensuring compliance (FedRAMP) and participating in 24x7 on-call rotations and cross-team collaboration.
Top Skills:
APIsAWSAws GovcloudAzureBashCaching TechnologiesCi/CdCloudwatchContainersDatabasesDatadogDeployment AutomationDistributed SystemsDnsDynatraceGoInfrastructure As CodeJavaKubernetesLoad BalancingMessaging SystemsNetworkingPowershellPythonSplunkStorage Platforms
Legal Tech • Software
As a Senior Site Reliability Engineer, you will lead reliability initiatives, design and maintain systems, enhance CI/CD pipelines, and mentor junior engineers while ensuring system availability and performance.
Top Skills:
AWSBashCloudwatchEc2EksIamKubernetesLambdaPowershellPythonS3
Healthtech
Design, scale, and operate secure AWS cloud infrastructure (EKS, IAM, RBAC); build and maintain IaC (Terraform/Terragrunt), GitHub Actions CI/CD, Datadog observability, and Python automation; document runbooks, participate in on-call rotations, postmortems, and Agile workflows to improve reliability and security.
Top Skills:
AWSDatadogEc2EksFargateGithub ActionsGithub Advanced SecurityHelmIamJIRAKubernetesLambdaPythonRbacSecrets ManagerServerlessTerraformTerragruntVpc
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Design, build, and operate global, multi-cloud HPC service platforms. Own IaC-driven provisioning, reliability, observability, capacity planning, incident response, and automation to ensure high uptime and QoS for internal customers.
Top Skills:
AiopsAWSCi/CdContainer ManagementGCPGoInfrastructure As CodeKubernetesLog CollectionLsfMetricsMonitoringObservabilityOciPerlPythonRubySlurm
Artificial Intelligence • Healthtech • Information Technology • Other • Analytics
The Senior Site Reliability Engineer will manage and optimize cloud infrastructure on AWS, design Kubernetes clusters, and automate workflows while mentoring junior members.
Top Skills:
AWSGithub ActionsKubernetesNewrelicPython
Artificial Intelligence • Healthtech • Software • Automation
Design, build, and operate Optura's multi-cloud, HIPAA-aware platform: run Kubernetes across cloud and customer on-prem/air-gapped environments, create unified deployment tooling (Helm/operators/GitOps), own SLOs/capacity/incident response, drive reliability, implement identity/networking/security controls, and build IaC/GitOps patterns in partnership with product and security teams.
Top Skills:
AksArgo CdAWSAzureBackstageCluster ApiCrossplaneDistributed TracingEksGCPGitopsGkeGoGrafanaHelmKmsKubernetesMtlsOidcOpenshiftOpentelemetryOperatorsPrometheusPulumiPythonRancherReplicatedSecrets ManagementService MeshTalosTerraformVpc
Information Technology • Security • Cybersecurity
Lead design, build, and scale of Kubernetes-based, multi-tenant infrastructure and AI tooling. Own CI/CD, IaC, GitOps, and streaming analytics (Kafka/Flink/ClickHouse). Improve observability, SLOs, automated testing, progressive delivery, incident response, and mentor teams on reliability, security, and automation.
Top Skills:
Ai/Llm ToolingAksArgo CdBashClickhouseDatadogEksFlinkGithub ActionsGitlab CiGitopsGkeGoGrafanaHelmJenkinsKafkaKubernetesMcp ServersOpentelemetryPrometheusPulumiPythonTerraform
Fintech
Design, build, and maintain platform services and observability tooling to ensure reliable, scalable, and recoverable production systems. Define SLOs/SLIs, automate infrastructure-as-code, manage Kubernetes/Docker environments, perform incident management and root cause analysis, mentor engineers, and improve developer experience and reliability through automation and tooling.
Top Skills:
Amazon RdsAnsibleAWSChefClaude Code PluginsCloud-NativeDockerElasticsearchEvent-Driven ArchitectureExcelGCPGcp CloudsqlGoJavaKotlinKubernetesKubernetes OperatorsLuceneMcp ServersMicrosoft OutlookMs Sql ServerMySQLObservabilityPostgresPuppetPythonRedisSolrSpring BootTerraformWord
Fintech
Design, build, and operate reliable, scalable production systems and SRE tooling. Define SLOs, implement observability, automate infrastructure-as-code, manage incident response and recoverability, mentor engineers, and maintain cloud-native services with capacity planning and failover strategies.
Top Skills:
Amazon RdsAnsibleAWSChefDockerElasticsearchGCPGcp CloudsqlKubernetesLinuxLuceneExcelMicrosoft OutlookMicrosoft WordMs Sql ServerMySQLPostgresPuppetRedisSolrTerraformUnix
Semiconductor • Manufacturing
The role involves leading reliability initiatives, designing patterns for AI operations, managing SLOs, and mentoring junior engineers. You'll ensure platform resilience and optimize CI/CD pipelines for an AI-first intelligence platform.
Top Skills:
AWSBashCloudwatchDatadogDockerEksGitopsJavaKubernetesLambdaPythonSpring BootTerraform
Artificial Intelligence • eCommerce
As a Senior Site Reliability Engineer, you'll ensure the reliability and scalability of production systems, define SLOs and SLIs, lead incident responses, and improve cloud infrastructure performance.
Top Skills:
AWSClickhousePostgresPulumiTemporalTurbopuffer
Security • Software • Cybersecurity
Design, build, and operate highly available, fault-tolerant cloud-native systems. Implement observability, automation, CI/CD, and IaC; respond to incidents, run RCAs, and drive reliability improvements.
Top Skills:
AWSAzureBashCi/CdDatadogGCPGoGrafanaInfrastructure As Code (Iac)KubernetesNew RelicOpentelemetryPrometheusPythonSplunkTerraform
New
Track Smarter, Apply Better.
Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.
Use For Free
Information Technology • Consulting
Design, deploy, and maintain mission-critical containerized and virtualized workloads; build and operate CI/CD pipelines, monitoring, and configuration management; provision developer toolchains; reduce developer friction; lead incident response and root cause analysis; ensure scalability, availability, and government compliance in production environments.
Top Skills:
AnsibleBashCi/CdDesired State ConfigurationF5Gitlab Ci/CdKubernetesMinioMonitoring/ObservabilityPortworxS3-Compatible ServicesVMware
Other
The Senior Site Reliability Engineer at Juul Labs ensures operational stability and performance of hybrid cloud infrastructure, leads automation, and handles critical incidents.
Top Skills:
AWSBashCloudFormationGCPNutanixPowershellPythonTerraform
Cloud • Software
As a Site Reliability / Gitops Engineer, you will automate operations, develop Infrastructure as Code, maintain core services, and collaborate on service architecture.
Top Skills:
Ci/CdCloud ComputingElasticsearchGrafanaInfrastructure As CodeLinuxPrometheusPython
Aerospace • Other
Design, deploy, and automate on‑prem and cloud infrastructure; manage Kubernetes and core services (databases, monitoring, storage); collaborate with software engineers to build scalable, operable systems; own service lifecycle from design to deployment and refinement.
Top Skills:
AnsibleBashBazelDatabasesDistributed DatabasesKubernetesLinuxMakefilesMonitoringPythonStorageTcp/IpTerraform
Software
The Site Reliability Engineer will enhance reliability, observability, and incident response of You.com's production services, while collaborating with teams to implement best practices and improve operational efficiency through tooling and automation.
Top Skills:
AWSBashCi/CdEksGhaGitGitGrafanaOpentelemetryPrometheusPythonTerraform
Artificial Intelligence • Cloud • Social Impact • Software • Wearables
The Senior Site Reliability Engineer I will enhance Axon's observability platform, work on distributed tracing, log aggregation, and metrics infrastructure, and develop internal tools while collaborating with engineering teams.
Top Skills:
ArgocdCdkCortexGoGrafanaHelmJaegerJavaLokiOpentelemetryPrometheusPythonTerraform
Cloud • Software
The Senior Site Reliability / Gitops Engineer will drive automation and collaboration within the IS team, enhancing Canonical's IT operations and services while managing infrastructure as code and cloud technologies.
Top Skills:
Cloud ComputingDockerElasticsearchGitopsGrafanaIacKubernetesLinuxPrometheusPython
Artificial Intelligence • Information Technology • Software
The Senior SRE will manage multi-cloud infrastructure, ensuring reliability and scalability. Responsibilities include building CI/CD pipelines, defining SLOs, and implementing automation.
Top Skills:
Ai-Assisted DevelopmentAWSAzureClaude CodeDatadogGCPGrafanaKubernetesTerraform
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Operate and improve an AI Data Center AIOps platform: monitor health, own SLOs/SLIs, handle incidents, manage Kubernetes deployments, maintain IaC/CI-CD (Helm/Terraform), and produce runbooks/automation to ensure reliability and scalable telemetry processing.
Top Skills:
BashCi/CdClickhouseElastic/ElasticsearchFlinkGrafanaHelmKafkaKubernetesObject StoragePrometheusPulsarPythonSparkTerraformTsdbs
Aerospace • Other
Design, deploy, and automate infrastructure for on‑prem and cloud compute. Manage core services (databases, monitoring, storage), collaborate with software teams to build scalable, operable systems, and own the service lifecycle from design through deployment, operation, and refinement to ensure secure, reliable, and autonomous satellite software services.
Top Skills:
AnsibleBashBazelCloudDatabasesKubernetesLinuxMakefilesMonitoringPythonTcp/IpTerraform
Aerospace • Other
Design, deploy, and automate core infrastructure (on‑prem and cloud) and manage Kubernetes and Linux fleets. Collaborate with engineers to build scalable, operable services and improve lifecycle: testing, CI/CD, deployment, monitoring, and performance.
Top Skills:
AnsibleBashBazelCi/CdCloudDatabasesDistributed DatabasesKubernetesLinuxMakefilesMonitoringPythonStorageTcp/IpTerraform
Aerospace • Other
Design, deploy, and automate on‑prem and cloud compute infrastructure; manage core infrastructure (databases, monitoring, storage); collaborate with software teams to build scalable, operable systems; improve service lifecycle from design through deployment, operation, and refinement.
Top Skills:
AnsibleBashBazelDatabasesKubernetesLinuxMakeMakefilesMonitoringPythonStorageTcp/IpTerraform
Fintech • Software
As a Senior Site Reliability Engineer, you'll build and scale internal platform offerings, design monitoring systems, and collaborate with software engineers to ensure application performance and reliability.
Top Skills:
AnsibleAWSCloudFormationDatadogDockerElk StackGrafanaGrpcJavaKubernetesPostgresPrometheusPythonTerraform
Let Your Resume Do The Work
Upload your resume to be matched with jobs you're a great fit for.
Success! We'll use this to further personalize your experience.
Popular Job Searches
All Software Engineer Jobs
.NET Developer Jobs
Aerospace Thermal Engineering Jobs
AI Engineer Jobs
Android Developer Jobs
Automation Engineer Jobs
Backend Developer Jobs
Blockchain Developer Jobs
C# Jobs
C++ Jobs
Cloud Architect Jobs
Cloud Engineer Jobs
Design Engineer Jobs
DevOps Engineer Jobs
Director Of Engineering Jobs
Electrical Engineering Jobs
Embedded Software Engineer Jobs
Engineering Jobs
Engineering Manager Jobs
Environmental Engineering Jobs
Field Engineer Jobs
Front End Developer Jobs
Full Stack Developer Jobs
Game Developer Jobs
Golang Jobs
Hardware Engineer Jobs
Industrial Engineering Jobs
iOS Developer Jobs
Java Developer Jobs
Javascript Developer Jobs
Linux Jobs
Manufacturing Engineer Jobs
Mechanical Engineering Jobs
Network Engineer Jobs
PHP Developer Jobs
Process Engineer Jobs
Project Engineer Jobs
Prompt Engineering Jobs
Python Jobs
QA Jobs
Robotics Engineer Jobs
Ruby on Rails Jobs
Salesforce Administrator Jobs
Salesforce Developer Jobs
Scala Jobs
Sharepoint Developer Jobs
Site Reliability Engineer Jobs
Software Engineering Manager Jobs
Solutions Architect Jobs
SQL Developer Jobs
Structural Engineer Jobs
System Engineer Jobs
Test Engineer Jobs
Web Developer Jobs
All Filters
Total selected ()
No Results
No Results
.png)





























