Get the job you really want.
Maximum of 25 job preferences reached.
Top Senior Site Reliability Engineer Jobs
Artificial Intelligence • Software • Generative AI
As a Principal SRE, you will lead reliability, scalability, and operational health of Gradial's platform, driving improvements and collaborating with engineering.
Top Skills:
Ci/CdInfrastructure As CodeKubernetesObservabilityPythonTypescript
Logistics • Software • Transportation
Lead and mentor teams in DevOps and SRE, architect scalable Azure Cloud infrastructure, implement CI/CD and IaC, ensure database reliability, and drive cross-functional collaboration.
Top Skills:
Azure CloudAzure DevopsCi/CdCosmosdbDockerElkGrafanaKubernetesMySQLPostgresPrometheusRedisSQL ServerTerraform
Information Technology • Legal Tech
The Senior Technology Site Reliability Engineer is responsible for maintaining and optimizing infrastructure and applications, ensuring reliability and performance while automating processes and collaborating with teams.
Top Skills:
AWSChefDatadogGoGrafanaJavaPrometheusPuppetPythonSaltTerraform
Healthtech • Software
Maintain reliability, performance, and scalability of cloud-hosted services and databases. Implement SRE best practices, define SLIs/SLOs, respond to incidents, build monitoring and automation, perform DBA tasks (backups, restores, tuning), support CI/CD and DB migrations, and document runbooks and procedures.
Top Skills:
Amazon RdsAzure Sql DatabaseBashEcs FargateFlywayGitlabJenkinsKubernetesLiquibaseOctopus DeployOraclePostgresPowershellPythonRedisSolarwinds DpaSQL Server
Software
The role involves managing compute infrastructure for decentralized applications, requiring critical thinking, documentation skills, and experience in Kubernetes and blockchain management.
Top Skills:
BlockchainGitopsInfrastructure-As-CodeKubernetesProgramming Languages
Cloud • Information Technology • Security • Software
Lead multi-team SRE, Virtualization, Networking, and AI/GPU infrastructure to deliver reliable, scalable hybrid platforms. Own roadmap, operational excellence, SLO/SLI programs, automation/GitOps, Kubernetes and OpenStack operations, AI compute reliability, and cross-functional alignment and staffing.
Top Skills:
Ai/Gpu ComputeAnsibleCephCi/CdCinderCsiFirewallsGitopsGpu SchedulingIngress ControllersKeystoneKubeflowKubernetesKubernetes CniKvmL4L7MlflowNetwork PolicyNeutronNovaObservabilityOpenshiftOpenstackProxmoxPulumiRayRobinRoutingSdnService MeshSlo/SliSwitchingTerraformTitan-K8STriton Inference ServerVanilla KubernetesVMwareVsanXcp-NgZfs
Reposted 21 Days AgoSaved
Easy Apply
Easy Apply
Information Technology • Security • Software
Manage daily operations of a classified NOC, focusing on Kubernetes services, incident response, system monitoring, and ensuring security and availability.
Top Skills:
Aws GovcloudAzure GovernmentC2EC2SDockerElastic StackFluentdFluxGrafanaHelmJIRAJwccKubernetesOsticketPrometheusTerraform
Software
The Principal Site Reliability Engineer will design and improve systems for reliability in payments software, guiding development cycles and incident response, while ensuring service health and organizational efficiency.
Top Skills:
CassandraGoJavaKafkaOraclePostgresPythonRabbitMQShell
Software
The Principal Site Reliability Engineer will enhance system reliability, promote SRE practices, lead organizational improvements, and ensure efficient software development and incident response processes.
Top Skills:
CassandraGoJavaKafkaOraclePostgresPythonRabbitMQShell
Artificial Intelligence • Cloud • Social Impact • Software • Wearables
As a Site Reliability Engineer II, you'll develop automation workflows, manage cloud operations, and enhance service reliability while participating in incident response and code reviews.
Top Skills:
ApmAWSAws CloudformationAzureC#Ci/CdGoJavaKubernetesObservability ToolsPythonTemporalTerraform
Other • Energy
Lead SRE practices for GCP-based data platforms, automate workflows, design reliable architectures, mentor engineers, and improve operational processes.
Top Skills:
BigQueryCi/CdCloud LoggingCloud MonitoringCloud StorageCompute EngineDataflowDatastreamGithub ActionsGitlab CiGkeGoogle Cloud PlatformIamKubernetesPub/SubPythonTerraform
Fintech • Payments
The Senior Staff SRE leads reliability engineering initiatives, drives operational excellence, mentors staff, and influences architecture to enhance system reliability and performance.
Top Skills:
Ai/MlAWSAzureDockerElk StackGCPGrafanaKubernetesMySQLNoSQLPostgresSplunk
New
Track Smarter, Apply Better.
Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.
Use For Free
Retail • Sports
Lead global D2C Site Reliability and Platform Operations to ensure availability, performance, and scalability of eCommerce and omnichannel systems. Define SRE strategy, SLIs/SLOs, incident management, observability, cloud operations, FinOps, vendor management, and global on-call models while building and developing high-performing teams and operational playbooks.
Top Skills:
AlertingCi/CdCloud InfrastructureError BudgetsFinopsIncident ManagementMonitoringObservabilitySite Reliability Engineering (Sre)SlasSlisSlos
Artificial Intelligence • eCommerce • Retail
Lead the SRE and DevOps team, ensure infrastructure reliability, oversee cloud operations, drive automation, and collaborate cross-functionally.
Top Skills:
AzureBashCi/CdDatadogDockerElk StackGoGrafanaKubernetesPowershellPrometheusPythonTerraform
Aerospace • Big Data • Greentech • Hardware • Social Impact
Design, deploy, and operate compute services for on-premises and cloud satellite imaging platforms. Build reproducible, scalable, highly available deployments, troubleshoot distributed systems, optimize constrained environments, document and automate operations, and participate in on-call rotations to ensure reliability for customer-facing and air-gapped deployments.
Top Skills:
AlloyAnsibleBashCudaGitopsGrafanaHelmJIRAK3SKubernetesKustomizeOpentelemetryPrometheusProxmoxPythonRke2TalosTerraform
Artificial Intelligence • Cloud • Information Technology • Mobile • Software • Consulting
The role involves designing and implementing observability solutions using OpenTelemetry, managing platform engineering tasks, and ensuring site reliability through various engineering practices.
Top Skills:
AWSAzureCi/CdCloudFormationDockerGCPGoJavaKubernetesNode.jsOpentelemetryPulumiPythonRustTerraform
Artificial Intelligence • Cloud • Information Technology • Mobile • Software • Consulting
The role involves designing and implementing OpenTelemetry solutions, optimizing telemetry infrastructure, establishing SRE practices, and managing observability across cloud platforms.
Top Skills:
ArgocdAWSAzureBashCloudFormationDockerGCPGithub ActionsGitlab CiGoJavaJenkinsNode.jsOpentelemetryPowershellPulumiPythonRustTerraform
Software
Join the SRE team to improve monitoring, alerting, observability, and reliability of Fireblocks' production systems. Triage incidents, run RCA, create runbooks and automation (Python, Lambda, shell, Ansible, ArgoCD), collaborate with R&D/support, and participate in on-call rotation.
Top Skills:
AnsibleArgocdAWSAws LambdaAzureBashBitbucketC++ChefCoralogixDatadogDockerGerritGitGitlabGCPHelmJavaScriptKubernetesLinuxMySQLNew RelicNginxNode.jsPhabricatorPrometheusPuppetPythonShellSplunk
Software
As an AI Support Engineer, you'll manage support requests, resolve user issues, optimize ML models, and contribute to product development.
Top Skills:
Tensorrt
Real Estate • Financial Services • PropTech
As a Site Reliability Engineer, you will support AWS Cloud products, optimize processes, enhance automation, and ensure system reliability and performance.
Top Skills:
ArgocdAWSAzure DevopsBashCi/CdCloudwatchDockerEksFluxcdGitKubernetesPowershellPythonSQLTerraform
Artificial Intelligence • Legal Tech • Professional Services • Software
As a Staff Software Engineer in Site Reliability, you'll manage infrastructure for reliability and scalability, lead incident management, and automate operational tasks.
Top Skills:
AWSAzureBashCloudFormationDatadogGCPGoIncidentioPagerdutyPulumiPythonSentryTerraform
Artificial Intelligence • Legal Tech • Professional Services • Software
As a Software Engineer in Site Reliability, you will ensure the reliability and performance of our AI platform through automation and strategic infrastructure management.
Top Skills:
AWSAzureBashCloudFormationDatadogGCPGoKubernetesPagerdutyPythonSentryTerraform
Energy
The Site Reliability Engineer will design and implement systems, drive automation, coordinate between teams, support deployed systems, and ensure scalability for rapid growth.
Top Skills:
Active DirectoryAnsibleAWSAzureChefJSONLinuxPuppetPythonRestVMwareWindows ServerYaml
Software
Lead SRE to define SRE strategy, architecture, and roadmap; design and operate containerized, compliant cloud environments; build observability, incident management, automation, and developer platform capabilities; mentor SRE team and collaborate with security, compliance, and product teams to ensure reliability at scale.
Top Skills:
AWSAws MarketplaceAzureAzure MarketplaceGCPGoogle Cloud MarketplaceGrafanaKubernetesPrometheusTerraform
Cloud • Security • Software • Cybersecurity
The Senior Site Reliability Engineer will enhance performance and reliability of distributed systems, define KPIs, and collaborate cross-functionally to improve infrastructure and operational efficiency.
Top Skills:
AdbmsBashDatadogGrafanaInternet ProtocolsJavaScriptOracle SqlPrometheusPython
Top Companies Hiring Senior Site Reliability Engineers
See AllPopular Job Searches
All Filters
Total selected ()
No Results
No Results






























