Get the job you really want.
Maximum of 25 job preferences reached.
Top Senior Site Reliability Engineer Jobs
AdTech • Marketing Tech • Analytics
As a Staff Software Engineer - SRE, you'll manage cloud infrastructure, improve application reliability, collaborate across teams, and support back-office systems.
Top Skills:
AWSDatadogDockerKafkaKibanaKubernetesLinuxPostgresPythonRdsRedshiftShell/BashSparkTerraform
AdTech • Marketing Tech • Analytics
Manage and support customer applications, improve system reliability, collaborate with teams on infrastructure needs, and help drive architectural decisions.
Top Skills:
Auto ScalingAWSCdnsDatadogDnsDockerKafkaKibanaKubernetesLinuxLoad BalancersPostgresProxy ServersPythonRdsRedshiftShell/BashSparkTerraformWafs
AdTech • Marketing Tech • Analytics
The Staff SRE DevOps Engineer will manage customer applications, improve system reliability, collaborate on architecture discussions, and support infrastructure needs across teams.
Top Skills:
AWSBashDatadogDockerKafkaKibanaKubernetesLinuxPostgresPythonRedshiftSparkTerraform
Information Technology
Lead technical strategy for observability, operational intelligence, and reliability. Architect telemetry and automation platforms, drive AIOps and large-scale IaC, lead incident response, mentor senior engineers, and standardize SLO/SLI and reliability practices across AWS cloud-native environments.
Top Skills:
AlbAws (VpcBashCloudFormationDatadogDnsDynamoDBEc2EcsEksGitopsGoGrafanaIamKmsKubernetesLinuxMulti-Account Architectures)New RelicNlbOpentelemetryPolicy-As-CodePrometheusPythonRdsRoute 53S3Tcp/IpTerraformTls
Artificial Intelligence • Big Data • Computer Vision • Machine Learning • Natural Language Processing • Software • Cybersecurity
Maintain and improve the internal developer platform, observability stack, and AWS infrastructure (Terraform); manage Kubernetes at scale; troubleshoot distributed systems; drive security, reliability, cost and performance improvements; partner with product teams and participate in on-call support.
Top Skills:
AWSCkaContainersGoKubernetesLgtm StackLinuxOpensearchPythonServerlessTcp/IpTerraform
Cloud • Security • Software • Cybersecurity
Design and maintain reliable infrastructure solutions for a cloud data protection platform. Ensure application scalability and support through CI/CD and monitoring tools while collaborating in a global team.
Top Skills:
AppinsightsAws CloudformationAzure Api ManagementAzure Arm TemplatesAzure Cosmos DbAzure DevopsAzure Entra IdAzure FunctionsAzure MonitorAzure Storage ServicesBashBitbucketElastic StackGitGoMicrosoft TfsPowershellPythonServerless FrameworkTerraform
Healthtech • Database
Responsible for reliability engineering, monitoring system performance, automating processes, and collaborating with development teams to enhance operational efficiency.
Top Skills:
AWSAzureBashCi/CdCloudFormationDockerDynatraceGCPGoJmeterKubernetesNeoloadPythonSplunkTerraform
Payments • Software • Automation
Lead platform and infrastructure direction on AWS, evolve CI/CD and ephemeral environments, set observability and SLO standards, drive incident response and postmortems, mentor engineers, and build automation to reduce operational risk.
Top Skills:
AWSCi/CdDistributed SystemsEcsEphemeral Environments/Preview DeploysFargateGithub ActionsLogsObservability (MetricsSlos/Slis/Error BudgetsTracing)
Artificial Intelligence • Information Technology • Software
Lead end-to-end platform reliability: define SLIs/SLOs, harden production architecture, ensure Kubernetes runtime and queue safety, run incident command for Sev1/Sev2, own observability/on-call/runbooks, and gate risky releases while delivering a prioritized reliability roadmap.
Top Skills:
BullmqKoaKubernetesNode.jsPostgraphilePostgresReactRedisTypescript
Aerospace • Manufacturing
As a Site Reliability Engineer, you'll build and manage observability platforms for satellite communications, define SLOs/SLIs, and collaborate on incident response and deployment automation.
Top Skills:
ArgocdAWSElkGCPGoGrafanaIstioJaegerKubernetesLinkerdLokiOpentelemetryPrometheusPythonTempoTerraform
Aerospace • Defense • Manufacturing
As Lead Site Reliability Engineer, you'll ensure reliability and performance of AI infrastructure, manage deployments, and mentor junior engineers.
Top Skills:
AnsibleBmcCi/CdCudaIdracImpiKubernetesLinuxNvidia GpusOpenshiftTerraform
Information Technology • Software • Big Data Analytics
The Site Reliability Engineer will design, analyze, and troubleshoot large-scale distributed systems, focusing on operating systems and performance tuning.
Top Skills:
ApacheJava
New
Cut your apply time in half.
Use ourAI Assistantto automatically fill your job applications.
Use For Free
Reposted 15 Days AgoSaved
Easy Apply
Easy Apply
Hardware • Quantum Computing
Maintain and integrate hardware and software systems for quantum controls, manage lab and test infrastructure (HIL, K8s, networking, rack servers), automate provisioning and CI/CD, implement monitoring/alerting and observability, support incident response and root-cause analysis, and define operational procedures to ensure reliability across development and production environments.
Top Skills:
AnsibleBashDebianDhcpDnsDockerElk StackGitGitlab CiGoGrafanaHardware-In-The-Loop (Hil)JenkinsKubernetesLanPrometheusPythonRack Mount ServersRed HatRoutersSwitchesTcp/IpTerraformUbuntuVlanWanWindows
Fintech
The Site Reliability Engineer will manage and optimize Kubernetes clusters and cloud infrastructure, focusing on reliability, monitoring, and automation processes.
Top Skills:
AWSAzureC/C++CloudFormationDockerGCPHelmJavaJavaScriptKubernetesLinuxPostgresPythonRubyTerraform
Big Data • Cloud • Healthtech • Software • Big Data Analytics
The Senior Site Reliability Engineer will ensure the reliability and scalability of enterprise applications, lead incident management, develop automation tools, mentor team members, and collaborate with cross-functional teams.
Top Skills:
AnsibleAWSBashDockerGitGoHibernateJavaKubernetesLinuxMavenMySQLPythonRubyShellSolrSpringTomcatVagrant
Artificial Intelligence • Automotive • Internet of Things • Software
The Site Reliability Engineer will ensure application reliability, performance, and availability, emphasizing incident response and collaboration with development teams.
Top Skills:
ActivemqAnsibleAppdynamicsAws LambdaCloudFormationCloudwatchEksGitGitJavaJavaScriptJenkinsJqueryKafkaKubernetesMskMySQLPostgresPythonRabbit MqRest ApisSignalsSpinnakerSQLTerraformVue
Big Data • Cloud • Healthtech • Software • Big Data Analytics
As a Senior Software Engineer, you'll ensure the scalability and reliability of enterprise applications, leading incident management, automation, and strategic engineering efforts while mentoring team members.
Top Skills:
AnsibleAWSBashDockerGitGoHibernateJavaKubernetesLinuxMavenMySQLPythonRubyShellSolrSpringTomcatVagrant
Big Data • Cloud • Healthtech • Software • Big Data Analytics
The Senior Software Engineer - SRE will ensure the reliability and scalability of enterprise applications, handle incident management, and mentor team members, requiring expertise in Java and open-source technologies.
Top Skills:
AnsibleAWSBashDockerGitGoHibernateJavaKubernetesLinuxMavenMySQLPythonRubyShellSolrSpringTomcatVagrant
Big Data • Cloud • Healthtech • Software • Big Data Analytics
As a Senior Site Reliability Engineer, ensure system reliability and scalability, lead incident management, develop automation tools, and mentor team members.
Top Skills:
AnsibleAWSBashDockerGitGoHibernateJavaKubernetesLinuxMavenMySQLPythonRubyShellSolrSpringTomcatVagrant
Big Data • Cloud • Healthtech • Software • Big Data Analytics
As a Senior Site Reliability Engineer at Veeva, you will enhance the reliability and scalability of applications, lead incident management, and mentor team members while working with modern technologies.
Top Skills:
AnsibleAWSBashDockerGitGoHibernateJavaKubernetesLinuxMavenMySQLPythonRubyShellSolrSpringTomcatVagrant
Big Data • Cloud • Healthtech • Software • Big Data Analytics
The role involves ensuring the scalability and reliability of enterprise applications through operational experience in Java environments, incident management, and full-stack diagnostics, in collaboration with cross-functional teams.
Top Skills:
AnsibleAWSBashDockerGitGoHibernateJavaKubernetesLinuxMavenMySQLPythonRubyShellSolrSpringTomcatVagrant
Big Data • Cloud • Healthtech • Software • Big Data Analytics
The Senior Site Reliability Engineer will ensure the scalability and reliability of enterprise applications, manage incidents, automate operations, mentor team members, and support cross-team collaborations across a technology stack, primarily focusing on backend development.
Top Skills:
AnsibleAWSBashDockerGitGoHibernateJavaKubernetesLinuxMavenMySQLPythonRubyShellSolrSpringTomcatVagrant
Fintech
The Staff Site Reliability Engineer role involves leading architecture, automating GCP environment, defining SLIs and SLOs, mentoring teammates, and enhancing system reliability and performance.
Top Skills:
ArgocdDatadogGCPGoHelmJavaScriptKubernetesPythonTerraformTypescript
Healthtech • Telehealth
Seeking a Site Reliability Engineer to ensure availability and performance of cloud infrastructure. Responsibilities include observability solutions, incident response, and collaboration with teams to improve reliability and service health.
Top Skills:
AnsibleAWSAzureAzure MonitorBashCloudwatchDynatraceElasticGrafanaPowershellPythonTerraform
Artificial Intelligence • Software • Generative AI
As a Staff Site Reliability Engineer, you will enhance system reliability and performance for WRITER's AI platform, utilizing cloud technologies, programming languages, and incident management practices.
Top Skills:
AWSAzureDockerElk StackGCPGoGrafanaKubernetesPrometheusPythonTerraform
Popular Job Searches
All Filters
Total selected ()
No Results
No Results



























