Top Site Reliability Engineer Jobs

Reposted 20 Days AgoSaved
In-Office or Remote
3 Locations
Senior level
Senior level
Artificial Intelligence
The Deployment Engineer will build and operate AI inference clusters, ensure scalable deployments, optimize allocation, and maintain infrastructure. Responsibilities include software updates, telemetry development, and collaborative improvements with teams.
Top Skills: DockerGrafanaInfluxdbK8SLinuxPrometheusPython
Reposted 20 Days AgoSaved
Hybrid
Boston, MA, USA
165K-190K Annually
Mid level
165K-190K Annually
Mid level
Artificial Intelligence • Healthtech • Information Technology • Software
As the first Site Reliability Engineer in the US, you'll ensure platform stability and oversee incident responses during PST hours, bridging infrastructure and code, while improving operability and compliance in a medical-device environment.
Top Skills: AWSElixirKubernetesTerraform
Reposted 20 Days AgoSaved
In-Office or Remote
Santa Clara, CA, USA
248K-397K Annually
Expert/Leader
248K-397K Annually
Expert/Leader
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Design, implement, and support a large-scale Observability & Telemetry platform. Ensure reliability, monitor system health, and automate processes while engaging in incident response and postmortems.
Top Skills: DockerGoGrafanaKubernetesLinuxOpenstackOpentelemetryPerlPrometheusPythonRuby
Reposted 20 Days AgoSaved
In-Office
Fort Meade, MD, USA
Senior level
Senior level
Artificial Intelligence • Information Technology • Cybersecurity • Defense
The Forward Deployed Site Reliability Engineer ensures the reliability of a mission-critical platform, manages incident response, defines SLIs and SLOs, and liaises between engineering and government customers.
Top Skills: AWSBashDockerGrafanaLokiMimirPrometheusPythonTerraform
Reposted 3 Days AgoSaved
In-Office
Seattle, WA, USA
160K-250K Annually
Mid level
160K-250K Annually
Mid level
Artificial Intelligence • Cloud • Software
The Senior Site Reliability Engineer will automate operations, optimize workflows for teams, manage secure infrastructure, and participate in on-call duties.
Top Skills: AristaAWSBashCephChefCifsCiscoDnsDockerElk StackFortinetHpHTTPIcmpIscsiJenkinsKubernetesLinux/Debian Family/UbuntuMesosphereNfsNode.jsPivotal GreenplumPostgresPythonRabbitMQRubyS3ScyllaSshSslSupermicroTcpTls
21 Days AgoSaved
In-Office
3 Locations
53K-90K Annually
Junior
53K-90K Annually
Junior
Healthtech • Financial Services
Support and maintain production, beta, and development web applications with rotating on-call duties. Troubleshoot complex incidents, perform root cause analysis, collaborate across teams, support deployments in on-prem and cloud (AWS/Azure), and ensure SLA compliance while participating in Agile/SAFe processes.
Top Skills: AWSAzureC#GitJavaPostgresPythonSQL
21 Days AgoSaved
Hybrid
Santa Clara, CA, USA
175K-215K Annually
Senior level
175K-215K Annually
Senior level
Artificial Intelligence • Robotics • Automation • Manufacturing
Lead the Platform & SRE team to design and operate a unified deployment platform spanning cloud and on-premise. Architect Pulumi/Kubernetes-based deployments, package software for industrial hardware, build GitHub Actions CI/CD (including HIL), and define observability with Prometheus, Grafana, and OIDC.
Top Skills: AWSAzureCertificate ManagementCnisGCPGithub ActionsGoGrafanaHelmHilKubernetesLinux NetworkingMulti-Cluster ManagementOidcPrometheusPulumiTerraform
Reposted 21 Days AgoSaved
In-Office
St. George, UT, USA
Mid level
Mid level
Cloud
The Site Reliability Engineer at TCN will design, deploy, and maintain systems for performance, reliability, and security, while managing incidents and collaborating with teams.
Top Skills: BashGoGoogle Cloud PlatformJavaKubernetesLinuxNode.jsPythonRuby
Reposted 21 Days AgoSaved
In-Office or Remote
8 Locations
170K-230K Annually
Mid level
170K-230K Annually
Mid level
Artificial Intelligence • Cloud • Information Technology • Software
Contribute to the reliability and performance of Mithril's GPU orchestration platform through automation, observability, and infrastructure management. Collaborate with the team to ensure scalability across multi-cloud environments while maintaining systems stability and implementing SLOs.
Top Skills: AWSAzureGCPGoGrafanaKubernetesLinuxOpentelemetryPrometheusPulumiPythonTcp/IpTerraform
Reposted 21 Days AgoSaved
In-Office or Remote
2 Locations
132K-221K Annually
Senior level
132K-221K Annually
Senior level
Healthtech • Information Technology • Software
The Sr. Database Site Reliability Engineer manages the reliability and performance of Azure PostgreSQL platforms, applying SRE principles for automation and observability. Responsibilities include incident response, backup strategies, and ensuring compliance with security standards.
Top Skills: ArgocdAzure PostgresqlCi/CdDatadogGitHelmKubernetesTerraform
Reposted 21 Days AgoSaved
In-Office
Tyson's Corner, VA, USA
100K-160K Annually
Junior
100K-160K Annually
Junior
Fintech
The Site Reliability Engineer will monitor and manage Kubernetes clusters, optimize Cloud Infrastructure, and automate processes using tools like Terraform and Docker.
Top Skills: Amazon S3AWSAzureC/C++CephDockerGCPHdfsHelmJavaJavaScriptKubernetesNfsPostgresPythonRubyTerraform
Reposted 21 Days AgoSaved
In-Office
Aurora, CO, USA
87K-198K Annually
Senior level
87K-198K Annually
Senior level
Information Technology
As a Senior Site Reliability Engineer, you will enhance system resilience, automate tasks, and ensure robust infrastructure for national security.
Top Skills: ConfluenceDockerGitGoJavaJenkinsJIRAKubernetesLinuxNessusPackerPythonRust
New

Cut your apply time in half.

Use ourAI Assistantto automatically fill your job applications.

Use For Free
Application Tracker Preview
Reposted 21 Days AgoSaved
In-Office
City of Broomfield, CO, USA
125K-187K Annually
Mid level
125K-187K Annually
Mid level
Cloud • Information Technology • Security • Software
Provide 24/7 technical support for a SaaS AI security platform, monitor uptime, triage and resolve incidents, collaborate with customers and development teams, and drive SRE-focused automation and reliability improvements.
Top Skills: Ai InferenceAWSGrafanaHTTPJSONKubernetesLinuxPostgresPrometheusPythonRest ApisSaaSTerraformTicketing System
22 Days AgoSaved
Remote
United States
165K-190K Annually
Senior level
165K-190K Annually
Senior level
Artificial Intelligence • Information Technology • Software • Automation
Own US PST coverage for releases and incidents as the first SRE; bridge infrastructure and code by working with Kubernetes, Terraform, and AWS and patching Elixir when needed; lead incident response and post-mortems; define SLOs and observability; author runbooks and support HIPAA-aligned compliance for a regulated medical-device platform.
Top Skills: AWSElixirKubernetesTerraform
22 Days AgoSaved
In-Office
San Francisco, CA, USA
210K-240K Annually
Senior level
210K-240K Annually
Senior level
Artificial Intelligence • Marketing Tech • Software • Big Data Analytics
Design, operate, and automate global network and reliability infrastructure for large-scale ML workloads and a private supercomputer. Own device configuration management, protocols (BGP, VPNs, WAN), datacenter fabrics, monitoring/SLOs, incident response, security/compliance, and cross-team reliability improvements.
Top Skills: AirflowAnsibleBashBgpBluefieldCniCumulus LinuxDatadogEcmpElkEvpn/VxlanFirewallsGrafanaInfinibandInfobloxIngressIpsec VpnsIscsiKafkaKubernetesLinuxLoad BalancersLustrefsMplsNetboxNetwork PolicyNfsNornirOpentelemetryPrometheusPythonQosService NetworkingSparkSpectrum-XSpine-LeafSwitchesTerraformVpnsWan Circuits
22 Days AgoSaved
In-Office
Houston, TX, USA
Senior level
Senior level
Big Data • Software
Maintain and improve service reliability and scalability by monitoring performance, troubleshooting production issues, implementing SLOs, performing capacity analysis, and developing automation and self-service tools. Participate in reliability testing, resilience evaluation, and release management while collaborating with teams to align monitoring and SLOs with user expectations.
Top Skills: Aws AlbAws NlbAzure Application GatewayBgpDhcpDnsEigrpF5FirewallFlow LogsFortigateIds/IpsIp Addressing/SubnettingLoad BalancerMicro-SegmentationNaclsNatNetwork Monitoring PlatformsNginxOspfPacket AnalyzerPrivatelinkSecurity GroupsVnetVpcVpc PeeringVpnVpn TechnologiesWaf
Reposted 3 Days AgoSaved
In-Office
San Francisco, CA, USA
160K-250K Annually
Mid level
160K-250K Annually
Mid level
Artificial Intelligence • Cloud • Software
The Senior Site Reliability Engineer will automate operations, improve workflows, manage secure infrastructure, and participate in on-call rotation for an AI-driven company.
Top Skills: AristaAWSBashCephChefCifsCiscoDnsDockerElk StackFortinetHpHTTPIcmpIpIscsiJenkinsKubernetesLinux/DebianMesosphereNfsNode.jsPivotal GreenplumPostgresPythonRabbitMQRaidRubyS3ScyllaSshSslSupermicroTcpTlsUbuntu
22 Days AgoSaved
Hybrid
Denver, CO, USA
190K-215K Annually
Expert/Leader
190K-215K Annually
Expert/Leader
Artificial Intelligence • Information Technology • Software • Infrastructure as a Service (IaaS)
Own fleet-scale reliability, upgradeability, and operational excellence for an edge platform. Design and operate automated, secure lifecycle systems, observability, and incident response. Lead cross-domain, high-severity incident ownership, set production standards (SLOs/SLIs, change management), mentor engineers, and apply AI to accelerate diagnostics and operational workflows.
Top Skills: Ai SystemsAudit LoggingCanary DeploymentsConfiguration ManagementFleet ManagementInfrastructure-As-CodeKubernetesLinuxObservabilitySecure BootStaged Rollouts
Reposted 22 Days AgoSaved
In-Office or Remote
3 Locations
Senior level
Senior level
Healthtech
The SRE will design and implement platform solutions, maintain cloud environments, monitor and troubleshoot production issues, and automate tasks to improve efficiency.
Top Skills: AnsibleAWSDockerGCPGitIacLinuxMySQLPHPTerraform
Reposted 22 Days AgoSaved
Hybrid
2 Locations
30K-120K Annually
Senior level
30K-120K Annually
Senior level
Information Technology • Automation
The SRE/Infrastructure Engineer will architect and manage secure, scalable systems for automated penetration testing, optimizing reliability, and enhancing infrastructure based on customer demand. Responsibilities include maintaining production environments, leading technical discussions, and promoting high coding standards.
Top Skills: AWSAzureCloudFormationElkGCPNew RelicOpentelemetryPostgresPrometheusTerraform
Reposted 22 Days AgoSaved
In-Office
Bellevue, WA, USA
164K-213K Annually
Senior level
164K-213K Annually
Senior level
AdTech
Lead and manage Site Reliability Engineering operations, focusing on system reliability, developer productivity, and strategic leadership to enhance service stability and streamline software delivery processes.
Top Skills: SparkAWSCircleCIDatadogJavaScriptKubernetesPythonSplunkTerraform
Reposted 22 Days AgoSaved
In-Office
Alpharetta, GA, USA
Senior level
Senior level
Fintech • Financial Services
As a Site Reliability Engineer for AI platform, you will ensure system reliability, develop automation, manage infrastructure, and evaluate new technologies.
Top Skills: AnsibleApache KafkaAWSAzureCloudFormationDatadogDockerEfkElkGoGCPGpu ClustersGrafanaHelmJavaKubernetesPrometheusPythonSnowflakeSparkTerraform
Reposted 22 Days AgoSaved
Remote
United States
100K-140K Annually
Mid level
100K-140K Annually
Mid level
Artificial Intelligence • Information Technology • Consulting
The Linux Systems Administrator will maintain and troubleshoot Linux systems, support network services, and work on systems integration while collaborating with infrastructure teams.
Top Skills: DhcpDnsLinuxNtpPython
Reposted 22 Days AgoSaved
Hybrid
Atlanta, GA, USA
Senior level
Senior level
HR Tech • Information Technology • Professional Services • Software • Business Intelligence • Consulting • Automation
Seeking a Site Reliability Engineer with expertise in Unix/Linux, scripting languages, and experience in containerization, cloud platforms, and application monitoring tools.
Top Skills: AnsibleApache TomcatAWSCassandraChefCoradiantDockerDynatraceElasticGomezGCPJenkinsKafkaLinuxMq SeriesOraclePuppetPythonShell ScriptingSplunkTealeafUnixVagrantWebsphere
Reposted 22 Days AgoSaved
In-Office
2 Locations
Expert/Leader
Expert/Leader
Fintech
The Principal AI Site Reliability Engineer leads initiatives to enhance operational efficiency and reliability for contact center applications, utilizing AI-driven automation and observability tools. They collaborate with teams to minimize manual intervention and proactively monitor system performance while mentoring and guiding teams towards excellence.
Top Skills: AIAutomationAWSAzureBashGCPIt Service ManagementMonitoring ToolsObservabilityPowershellPython
All Filters
JobType
New Jobs
Job Category
Experience
Industry
Company Name
Company Size

Sign up now Access later

Create Free Account