Job Title, Company or Keyword

Maximum of 25 job preferences reached.

Top Site Reliability Engineer Jobs

Cerebras Systems

Staff Site Reliability Engineer – Automation and Platform

Reposted 20 Days AgoSaved

In-Office or Remote

3 Locations

Senior level

Artificial Intelligence

The Deployment Engineer will build and operate AI inference clusters, ensure scalable deployments, optimize allocation, and maintain infrastructure. Responsibilities include software updates, telemetry development, and collaborative improvements with teams.

Top Skills: DockerGrafanaInfluxdbK8SLinuxPrometheusPython

Sonio

Site Reliability Engineer (SRE) - Boston

Reposted 20 Days AgoSaved

Hybrid

Boston, MA, USA

165K-190K Annually

Mid level

165K-190K Annually

Mid level

Artificial Intelligence • Healthtech • Information Technology • Software

As the first Site Reliability Engineer in the US, you'll ensure platform stability and oversee incident responses during PST hours, bridging infrastructure and code, while improving operability and compliance in a medical-device environment.

Top Skills: AWSElixirKubernetesTerraform

NVIDIA

Principal Site Reliability Engineer - Observability and Telemetry Platform

Reposted 20 Days AgoSaved

In-Office or Remote

Santa Clara, CA, USA

248K-397K Annually

Expert/Leader

248K-397K Annually

Expert/Leader

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse

Design, implement, and support a large-scale Observability & Telemetry platform. Ensure reliability, monitor system health, and automate processes while engaging in incident response and postmortems.

Top Skills: DockerGoGrafanaKubernetesLinuxOpenstackOpentelemetryPerlPrometheusPythonRuby

Twenty

Forward Deployed Site Reliability Engineer

Reposted 20 Days AgoSaved

In-Office

Fort Meade, MD, USA

Senior level

Artificial Intelligence • Information Technology • Cybersecurity • Defense

The Forward Deployed Site Reliability Engineer ensures the reliability of a mission-critical platform, manages incident response, defines SLIs and SLOs, and liaises between engineering and government customers.

Top Skills: AWSBashDockerGrafanaLokiMimirPrometheusPythonTerraform

Hive

Senior Site Reliability Engineer

Reposted 3 Days AgoSaved

In-Office

Seattle, WA, USA

160K-250K Annually

Mid level

160K-250K Annually

Mid level

Artificial Intelligence • Cloud • Software

The Senior Site Reliability Engineer will automate operations, optimize workflows for teams, manage secure infrastructure, and participate in on-call duties.

Top Skills: AristaAWSBashCephChefCifsCiscoDnsDockerElk StackFortinetHpHTTPIcmpIscsiJenkinsKubernetesLinux/Debian Family/UbuntuMesosphereNfsNode.jsPivotal GreenplumPostgresPythonRabbitMQRubyS3ScyllaSshSslSupermicroTcpTls

Alegeus

Site Reliability Engineer I

21 Days AgoSaved

In-Office

3 Locations

53K-90K Annually

Junior

53K-90K Annually

Junior

Healthtech • Financial Services

Support and maintain production, beta, and development web applications with rotating on-call duties. Troubleshoot complex incidents, perform root cause analysis, collaborate across teams, support deployments in on-prem and cloud (AWS/Azure), and ensure SLA compliance while participating in Agile/SAFe processes.

Top Skills: AWSAzureC#GitJavaPostgresPythonSQL

Kerrigan Robotics

Engineering Lead – Platform & SRE

21 Days AgoSaved

Hybrid

Santa Clara, CA, USA

175K-215K Annually

Senior level

175K-215K Annually

Senior level

Artificial Intelligence • Robotics • Automation • Manufacturing

Lead the Platform & SRE team to design and operate a unified deployment platform spanning cloud and on-premise. Architect Pulumi/Kubernetes-based deployments, package software for industrial hardware, build GitHub Actions CI/CD (including HIL), and define observability with Prometheus, Grafana, and OIDC.

Top Skills: AWSAzureCertificate ManagementCnisGCPGithub ActionsGoGrafanaHelmHilKubernetesLinux NetworkingMulti-Cluster ManagementOidcPrometheusPulumiTerraform

TCN

Site Reliability Engineer

Reposted 21 Days AgoSaved

In-Office

St. George, UT, USA

Mid level

Cloud

The Site Reliability Engineer at TCN will design, deploy, and maintain systems for performance, reliability, and security, while managing incidents and collaborating with teams.

Top Skills: BashGoGoogle Cloud PlatformJavaKubernetesLinuxNode.jsPythonRuby

Mithril

Site Reliability Engineer (SRE)

Reposted 21 Days AgoSaved

In-Office or Remote

8 Locations

170K-230K Annually

Mid level

170K-230K Annually

Mid level

Artificial Intelligence • Cloud • Information Technology • Software

Contribute to the reliability and performance of Mithril's GPU orchestration platform through automation, observability, and infrastructure management. Collaborate with the team to ensure scalability across multi-cloud environments while maintaining systems stability and implementing SLOs.

Top Skills: AWSAzureGCPGoGrafanaKubernetesLinuxOpentelemetryPrometheusPulumiPythonTcp/IpTerraform

CoverMyMeds

Sr. Database Site Reliability Engineer (DB SRE)

Reposted 21 Days AgoSaved

In-Office or Remote

2 Locations

132K-221K Annually

Senior level

132K-221K Annually

Senior level

Healthtech • Information Technology • Software

The Sr. Database Site Reliability Engineer manages the reliability and performance of Azure PostgreSQL platforms, applying SRE principles for automation and observability. Responsibilities include incident response, backup strategies, and ensuring compliance with security standards.

Top Skills: ArgocdAzure PostgresqlCi/CdDatadogGitHelmKubernetesTerraform

Cathexis

Site Reliability Engineer - Top Secret

Reposted 21 Days AgoSaved

In-Office

Tyson's Corner, VA, USA

100K-160K Annually

Junior

100K-160K Annually

Junior

Fintech

The Site Reliability Engineer will monitor and manage Kubernetes clusters, optimize Cloud Infrastructure, and automate processes using tools like Terraform and Docker.

Top Skills: Amazon S3AWSAzureC/C++CephDockerGCPHdfsHelmJavaJavaScriptKubernetesNfsPostgresPythonRubyTerraform

Booz Allen Hamilton

Site Reliability Engineer, Senior

Reposted 21 Days AgoSaved

In-Office

Aurora, CO, USA

87K-198K Annually

Senior level

87K-198K Annually

Senior level

Information Technology

As a Senior Site Reliability Engineer, you will enhance system resilience, automate tasks, and ensure robust infrastructure for national security.

Top Skills: ConfluenceDockerGitGoJavaJenkinsJIRAKubernetesLinuxNessusPackerPythonRust

New

Cut your apply time in half.

Use ourAI Assistantto automatically fill your job applications.

Use For Free

Cloud Support Engineer (SRE Development)

Reposted 21 Days AgoSaved

In-Office

City of Broomfield, CO, USA

125K-187K Annually

Mid level

125K-187K Annually

Mid level

Cloud • Information Technology • Security • Software

Provide 24/7 technical support for a SaaS AI security platform, monitor uptime, triage and resolve incidents, collaborate with customers and development teams, and drive SRE-focused automation and reliability improvements.

Top Skills: Ai InferenceAWSGrafanaHTTPJSONKubernetesLinuxPostgresPrometheusPythonRest ApisSaaSTerraformTicketing System

Xpert Development LLC

Senior DevOps & Site Reliability Engineer

22 Days AgoSaved

Remote

United States

165K-190K Annually

Senior level

165K-190K Annually

Senior level

Artificial Intelligence • Information Technology • Software • Automation

Own US PST coverage for releases and incidents as the first SRE; bridge infrastructure and code by working with Kubernetes, Terraform, and AWS and patching Elixir when needed; lead incident response and post-mortems; define SLOs and observability; author runbooks and support HIPAA-aligned compliance for a regulated medical-device platform.

Top Skills: AWSElixirKubernetesTerraform

Alembic

Senior Network & Site Reliability Engineer

22 Days AgoSaved

In-Office

San Francisco, CA, USA

210K-240K Annually

Senior level

210K-240K Annually

Senior level

Artificial Intelligence • Marketing Tech • Software • Big Data Analytics

Design, operate, and automate global network and reliability infrastructure for large-scale ML workloads and a private supercomputer. Own device configuration management, protocols (BGP, VPNs, WAN), datacenter fabrics, monitoring/SLOs, incident response, security/compliance, and cross-team reliability improvements.

Top Skills: AirflowAnsibleBashBgpBluefieldCniCumulus LinuxDatadogEcmpElkEvpn/VxlanFirewallsGrafanaInfinibandInfobloxIngressIpsec VpnsIscsiKafkaKubernetesLinuxLoad BalancersLustrefsMplsNetboxNetwork PolicyNfsNornirOpentelemetryPrometheusPythonQosService NetworkingSparkSpectrum-XSpine-LeafSwitchesTerraformVpnsWan Circuits

PROS

Site Reliability Engineer II

22 Days AgoSaved

In-Office

Houston, TX, USA

Senior level

Big Data • Software

Maintain and improve service reliability and scalability by monitoring performance, troubleshooting production issues, implementing SLOs, performing capacity analysis, and developing automation and self-service tools. Participate in reliability testing, resilience evaluation, and release management while collaborating with teams to align monitoring and SLOs with user expectations.

Top Skills: Aws AlbAws NlbAzure Application GatewayBgpDhcpDnsEigrpF5FirewallFlow LogsFortigateIds/IpsIp Addressing/SubnettingLoad BalancerMicro-SegmentationNaclsNatNetwork Monitoring PlatformsNginxOspfPacket AnalyzerPrivatelinkSecurity GroupsVnetVpcVpc PeeringVpnVpn TechnologiesWaf

Hive

Senior Site Reliability Engineer

Reposted 3 Days AgoSaved

In-Office

San Francisco, CA, USA

160K-250K Annually

Mid level

160K-250K Annually

Mid level

Artificial Intelligence • Cloud • Software

The Senior Site Reliability Engineer will automate operations, improve workflows, manage secure infrastructure, and participate in on-call rotation for an AI-driven company.

Top Skills: AristaAWSBashCephChefCifsCiscoDnsDockerElk StackFortinetHpHTTPIcmpIpIscsiJenkinsKubernetesLinux/DebianMesosphereNfsNode.jsPivotal GreenplumPostgresPythonRabbitMQRaidRubyS3ScyllaSshSslSupermicroTcpTlsUbuntu

Edgescale AI

Principal Core Engineer — Infra / SRE

22 Days AgoSaved

Hybrid

Denver, CO, USA

190K-215K Annually

Expert/Leader

190K-215K Annually

Expert/Leader

Artificial Intelligence • Information Technology • Software • Infrastructure as a Service (IaaS)

Own fleet-scale reliability, upgradeability, and operational excellence for an edge platform. Design and operate automated, secure lifecycle systems, observability, and incident response. Lead cross-domain, high-severity incident ownership, set production standards (SLOs/SLIs, change management), mentor engineers, and apply AI to accelerate diagnostics and operational workflows.

Top Skills: Ai SystemsAudit LoggingCanary DeploymentsConfiguration ManagementFleet ManagementInfrastructure-As-CodeKubernetesLinuxObservabilitySecure BootStaged Rollouts

Elligint Health

DevOps/Site Reliability Engineer (SRE)

Reposted 22 Days AgoSaved

In-Office or Remote

3 Locations

Senior level

Healthtech

The SRE will design and implement platform solutions, maintain cloud environments, monitor and troubleshoot production issues, and automate tasks to improve efficiency.

Top Skills: AnsibleAWSDockerGCPGitIacLinuxMySQLPHPTerraform

RunSybil

SRE/Infrastructure Engineer

Reposted 22 Days AgoSaved

Hybrid

2 Locations

30K-120K Annually

Senior level

30K-120K Annually

Senior level

Information Technology • Automation

The SRE/Infrastructure Engineer will architect and manage secure, scalable systems for automated penetration testing, optimizing reliability, and enhancing infrastructure based on customer demand. Responsibilities include maintaining production environments, leading technical discussions, and promoting high coding standards.

Top Skills: AWSAzureCloudFormationElkGCPNew RelicOpentelemetryPostgresPrometheusTerraform

iSpot.tv

Principal Site Reliability Engineer

Reposted 22 Days AgoSaved

In-Office

Bellevue, WA, USA

164K-213K Annually

Senior level

164K-213K Annually

Senior level

AdTech

Lead and manage Site Reliability Engineering operations, focusing on system reliability, developer productivity, and strategic leadership to enhance service stability and streamline software delivery processes.

Top Skills: SparkAWSCircleCIDatadogJavaScriptKubernetesPythonSplunkTerraform

Morgan Stanley

Site Reliability Engineer (SRE) - AI Platform & Cloud

Reposted 22 Days AgoSaved

In-Office

Alpharetta, GA, USA

Senior level

Fintech • Financial Services

As a Site Reliability Engineer for AI platform, you will ensure system reliability, develop automation, manage infrastructure, and evaluate new technologies.

Top Skills: AnsibleApache KafkaAWSAzureCloudFormationDatadogDockerEfkElkGoGCPGpu ClustersGrafanaHelmJavaKubernetesPrometheusPythonSnowflakeSparkTerraform

Nebius

Site Reliability Engineer

Reposted 22 Days AgoSaved

Remote

United States

100K-140K Annually

Mid level

100K-140K Annually

Mid level

Artificial Intelligence • Information Technology • Consulting

The Linux Systems Administrator will maintain and troubleshoot Linux systems, support network services, and work on systems integration while collaborating with infrastructure teams.

Top Skills: DhcpDnsLinuxNtpPython

Reveille Technologies Inc.

Site Reliability Engineer/System Engineer

Reposted 22 Days AgoSaved

Hybrid

Atlanta, GA, USA

Senior level

HR Tech • Information Technology • Professional Services • Software • Business Intelligence • Consulting • Automation

Seeking a Site Reliability Engineer with expertise in Unix/Linux, scripting languages, and experience in containerization, cloud platforms, and application monitoring tools.

Top Skills: AnsibleApache TomcatAWSCassandraChefCoradiantDockerDynatraceElasticGomezGCPJenkinsKafkaLinuxMq SeriesOraclePuppetPythonShell ScriptingSplunkTealeafUnixVagrantWebsphere

Fidelity Investments

Principal AI Site Reliability Engineer, EI Production Services

Reposted 22 Days AgoSaved

In-Office

2 Locations

Expert/Leader

Fintech

The Principal AI Site Reliability Engineer leads initiatives to enhance operational efficiency and reliability for contact center applications, utilizing AI-driven automation and observability tools. They collaborate with teams to minimize manual intervention and proactively monitor system performance while mentoring and guiding teams towards excellence.

Top Skills: AIAutomationAWSAzureBashGCPIt Service ManagementMonitoring ToolsObservabilityPowershellPython