Maximum of 25 job preferences reached.
Top Site Reliability Engineer Jobs
24 Days AgoSaved
Easy Apply
Easy Apply
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Own reliability, monitoring, and incident response for AI infrastructure; build automation and CI/CD tooling; manage Kubernetes/Docker production workloads; partner with infrastructure, security, and compliance; improve observability and documentation; develop internal full‑stack tooling in Go or Python.
Top Skills:
AnsibleAWSBashChefCi/CdDockerEc2GitGoKubernetesLinuxLog AggregationNetwork SecurityPuppetPythonRubySaltTerraform
Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Lead Site Reliability Engineer responsible for platform stability, mentoring, and improving application performance through automation, configuration management, and operational readiness.
Top Skills:
GoJavaPythonSpring Framework
Aerospace • Hardware • Robotics • Software • Manufacturing
Design, implement, and maintain a scalable SRE/DevOps platform across cloud and on-prem sites. Ensure uptime, automate deployments with IaC, define SLOs, leverage configuration management, and partner with development and manufacturing teams to increase automation and reliability.
Top Skills:
GCPHelmKubernetesTerraform
Artificial Intelligence • Legal Tech • Software
As a Senior Site Reliability Engineer, you'll operate foundational platform services, enhance reliability standards, automate processes, and work with engineering teams to improve systems.
Top Skills:
Cloud InfrastructureKubernetesObservability Tools
Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Lead reliability, scalability, and production operations for a greenfield enterprise application. Influence design for production readiness, own incident response, define SLIs/SLOs, build observability and automation, enhance CI/CD, and improve developer experience across infrastructure and application stacks.
Top Skills:
AWSChatgptClaudeCopilotDockerElasticsearchGithub ActionsGoGrafanaKubernetesOpensearchOpsgeniePrometheusSpring Boot
HR Tech • Information Technology • Professional Services • Sales • Software
Own and operate production-grade Kubernetes infrastructure on AWS, build GitOps CI/CD with GitHub Actions and ArgoCD, develop AI agents and internal DevOps tooling, maintain Datadog-based observability, and manage on-call incident response while collaborating with engineering teams to improve reliability and delivery speed.
Top Skills:
Ai/LlmArgocdAWSCi/CdDatadogGithub ActionsGitopsGoKubernetesPython
Artificial Intelligence • Information Technology • Software
The role involves designing and managing multi-cloud infrastructure, implementing CI/CD pipelines, ensuring platform reliability, scalability, and security, while optimizing performance for a SaaS platform used by enterprise customers.
Top Skills:
ArgoAWSAzureDatadogDockerGCPGithub ActionsGoKubernetesPythonTerraform
Reposted 4 Days AgoSaved
Easy Apply
Easy Apply
Big Data • Cloud • Software • Database
Develop and maintain Kubernetes runtime environments, support developers, resolve critical issues, and participate in on-call rotations for production systems.
Top Skills:
AWSAzureCert-ManagerCorednsCrdsCriCsiGatekeeperGCPGoHelmKubernetesKustomizeOperatorsPythonTerraform
Artificial Intelligence • Fintech • Payments • Social Impact • Analytics • Financial Services • Automation
As a Senior SRE, you'll ensure reliable and scalable systems, develop observability solutions and infrastructure as code, and lead incident response efforts.
Top Skills:
AWSCloudFormationDatadogElkPrometheusTerraform
Agency • Marketing Tech • Software • Consulting
Lead and maintain performance, security, and reliability of client hosting environments across multi-cloud platforms. Architect resilient infrastructure, manage IaC and CI/CD, administer Windows/IIS and WP Engine environments, handle SSL/DNS/SSO, participate in on-call rotations, and engage with clients as senior escalation and trusted advisor.
Top Skills:
App ServicesApplication InsightsAWSAzureAzure DevopsAzure SqlCi/CdDnsEc2IamIisKey VaultsPowershellRdsS3Ssl/TlsSsoTerraformVulnerability ManagementWindows ServerWp Engine
Fintech • Insurance • Financial Services
Provide SRE support for platform-level applications: incident management, performance/availability monitoring, root cause analysis, automation to reduce toil, disaster recovery participation, and technical leadership for reliability improvements.
Top Skills:
AWSAzureSQL ServerWindows
Digital Media • Information Technology • News + Entertainment
Responsible for ensuring reliability, scalability, and performance of data platforms: monitoring, incident response, automation, performance tuning, capacity planning, security/compliance, documentation, and cross-team collaboration to support large-scale data pipelines and backend data systems.
Top Skills:
AerospikeAnsibleAWSAws S3AzureCassandraCi/CdContainerizationDockerElk StackGCPGoGrafanaHadoopHdfsJavaKafkaKubernetesMicroservicesMySQLNoSQLPostgresPrometheusPythonScalaSnowflakeSparkTerraform
New
Track Smarter, Apply Better.
Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.
Use For Free
Digital Media • Information Technology • News + Entertainment
Responsible for ensuring reliability, scalability, and performance of data platforms. Design monitoring and alerting, automate deployments and recovery, optimize storage and query performance, troubleshoot incidents, plan capacity and scaling, document operations, enforce security/compliance, and collaborate with data engineering, product, and data science teams to maintain high availability of large-scale data systems.
Top Skills:
AnsibleAWSAzureCi/CdDockerElk StackGCPGoGrafanaJavaKubernetesMySQLNoSQLPostgresPrometheusPythonScalaTerraform
Cybersecurity
Ensure reliability, scalability, observability, and cost efficiency of a customer-facing SaaS security platform. Manage Kubernetes/Helm deployments, CI/CD (GitLab/ArgoCD), monitoring, and service verification. Embed with engineering teams, optimize developer CI/CD workflows, monitor and debug production on AWS/GCP, and participate in a 24/7 on-call rotation.
Top Skills:
ArgocdAWSGCPGitlab Ci/CdGrafanaHelmKubernetesMicroservicesPrometheus
Information Technology
Design and architect highly available OSS/BSS and mainframe systems using SRE principles. Lead reliability, observability, automation, disaster recovery, incident management, and cross-functional transformations across hybrid cloud and on‑prem environments for telecom operations.
Top Skills:
AppdynamicsCi/CdCicsDb2DevOpsDynatraceGrafanaHybrid CloudIbm MainframeIbm NetcoolIbm Z/OsImsInfrastructure As CodeInstanaJclLinuxSolarisSplunkSreTelcordia FacsTelcordia SoacTelcordia SwitchTelcordia TirksTelcordia WfaVsam
Artificial Intelligence • Big Data • Information Technology • Software • Analytics
Own reliability for a live fleet of Linux-based edge sensors and cloud infrastructure. Triage and recover field hardware, perform SSH-based diagnostics, build fleet management and OTA systems, implement observability and alerting, automate operational tasks, develop runbooks, and participate in on-call rotations to prevent and resolve incidents.
Top Skills:
AWSBashCDnsDockerFirewallsGoIamKubernetesLinuxPythonRustSshVpn
Artificial Intelligence • Information Technology • Software
Lead end-to-end platform reliability: define SLIs/SLOs, harden production architecture, ensure Kubernetes runtime and queue safety, run incident command for Sev1/Sev2, own observability/on-call/runbooks, and gate risky releases while delivering a prioritized reliability roadmap.
Top Skills:
BullmqKoaKubernetesNode.jsPostgraphilePostgresReactRedisTypescript
Information Technology
As a Senior Site Reliability Engineer, you'll enhance system resilience, automate tasks, and improve infrastructure for the Intelligence Community. You'll need significant Linux experience and programming knowledge.
Top Skills:
ConfluenceDockerGitGoHpJavaJenkinsJIRAKubernetesLinuxNessusPackerPythonRust
Aerospace • Defense • Manufacturing
As Lead Site Reliability Engineer, you'll ensure reliability and performance of AI infrastructure, manage deployments, and mentor junior engineers.
Top Skills:
AnsibleBmcCi/CdCudaIdracImpiKubernetesLinuxNvidia GpusOpenshiftTerraform
Reposted 21 Hours AgoSaved
Hardware • Quantum Computing
Maintain and integrate hardware and software systems for quantum controls, manage lab and test infrastructure (HIL, K8s, networking, rack servers), automate provisioning and CI/CD, implement monitoring/alerting and observability, support incident response and root-cause analysis, and define operational procedures to ensure reliability across development and production environments.
Top Skills:
AnsibleBashDebianDhcpDnsDockerElk StackGitGitlab CiGoGrafanaHardware-In-The-Loop (Hil)JenkinsKubernetesLanPrometheusPythonRack Mount ServersRed HatRoutersSwitchesTcp/IpTerraformUbuntuVlanWanWindows
Artificial Intelligence • Software
Own the reliability and performance of backend systems at Gamma, building automation and tooling while leading incident response and improving system stability.
Top Skills:
AWSCloudFormationDockerGoKafkaKubernetesNode.jsPythonTerraformTypescript
Artificial Intelligence • Software
As Staff SRE Tech Lead, you'll oversee platform reliability and scalability, lead the SRE team, architect data infrastructures, and optimize systems while implementing automation and observability practices.
Top Skills:
ClickhouseGoPostgresPythonTypescript
Aerospace • Manufacturing
As a Site Reliability Engineer, you'll build and manage observability platforms for satellite communications, define SLOs/SLIs, and collaborate on incident response and deployment automation.
Top Skills:
ArgocdAWSElkGCPGoGrafanaIstioJaegerKubernetesLinkerdLokiOpentelemetryPrometheusPythonTempoTerraform
Software
The role involves ensuring reliable SIP connectivity, conducting interoperability testing, troubleshooting SIP issues, and automating operations tasks, while mentoring junior staff.
Top Skills:
AnsibleBashEmpirixHepicLinuxPythonRtpSbcsSdpsSipSippVoipWireshark
Artificial Intelligence • Fintech • Software • Financial Services
The SRE will own reliability for a cloud-native platform, optimizing performance, availability, and observability, while mentoring engineering teams.
Top Skills:
AWSClickhouseGoKafkaKubernetesPulumiPythonTerraform
Let Your Resume Do The Work
Upload your resume to be matched with jobs you're a great fit for.
Success! We'll use this to further personalize your experience.
Top Companies Hiring Site Reliability Engineers
See AllPopular Job Searches
All Software Engineer Jobs
.NET Developer Jobs
Aerospace Thermal Engineering Jobs
AI Engineer Jobs
Android Developer Jobs
Automation Engineer Jobs
Backend Developer Jobs
Blockchain Developer Jobs
C# Jobs
C++ Jobs
Cloud Architect Jobs
Cloud Engineer Jobs
Design Engineer Jobs
DevOps Engineer Jobs
Director Of Engineering Jobs
Electrical Engineering Jobs
Embedded Software Engineer Jobs
Engineering Jobs
Engineering Manager Jobs
Environmental Engineering Jobs
Field Engineer Jobs
Front End Developer Jobs
Full Stack Developer Jobs
Game Developer Jobs
Golang Jobs
Hardware Engineer Jobs
Industrial Engineering Jobs
iOS Developer Jobs
Java Developer Jobs
Javascript Developer Jobs
Linux Jobs
Manufacturing Engineer Jobs
Mechanical Engineering Jobs
Network Engineer Jobs
PHP Developer Jobs
Process Engineer Jobs
Project Engineer Jobs
Prompt Engineering Jobs
Python Jobs
QA Jobs
Robotics Engineer Jobs
Ruby on Rails Jobs
Salesforce Administrator Jobs
Salesforce Developer Jobs
Scala Jobs
Sharepoint Developer Jobs
Site Reliability Engineer Jobs
Software Engineering Manager Jobs
Solutions Architect Jobs
SQL Developer Jobs
Structural Engineer Jobs
System Engineer Jobs
Test Engineer Jobs
Web Developer Jobs
All Filters
Total selected ()
No Results
No Results
.png)
































