Maximum of 25 job preferences reached.
Top Site Reliability Engineer Jobs
Reposted 20 Days AgoSaved
Artificial Intelligence
The Deployment Engineer will build and operate AI inference clusters, ensure scalable deployments, optimize allocation, and maintain infrastructure. Responsibilities include software updates, telemetry development, and collaborative improvements with teams.
Top Skills:
DockerGrafanaInfluxdbK8SLinuxPrometheusPython
Artificial Intelligence • Healthtech • Information Technology • Software
As the first Site Reliability Engineer in the US, you'll ensure platform stability and oversee incident responses during PST hours, bridging infrastructure and code, while improving operability and compliance in a medical-device environment.
Top Skills:
AWSElixirKubernetesTerraform
Reposted 20 Days AgoSaved
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Design, implement, and support a large-scale Observability & Telemetry platform. Ensure reliability, monitor system health, and automate processes while engaging in incident response and postmortems.
Top Skills:
DockerGoGrafanaKubernetesLinuxOpenstackOpentelemetryPerlPrometheusPythonRuby
Artificial Intelligence • Information Technology • Cybersecurity • Defense
The Forward Deployed Site Reliability Engineer ensures the reliability of a mission-critical platform, manages incident response, defines SLIs and SLOs, and liaises between engineering and government customers.
Top Skills:
AWSBashDockerGrafanaLokiMimirPrometheusPythonTerraform
Artificial Intelligence • Cloud • Software
The Senior Site Reliability Engineer will automate operations, optimize workflows for teams, manage secure infrastructure, and participate in on-call duties.
Top Skills:
AristaAWSBashCephChefCifsCiscoDnsDockerElk StackFortinetHpHTTPIcmpIscsiJenkinsKubernetesLinux/Debian Family/UbuntuMesosphereNfsNode.jsPivotal GreenplumPostgresPythonRabbitMQRubyS3ScyllaSshSslSupermicroTcpTls
Healthtech • Financial Services
Support and maintain production, beta, and development web applications with rotating on-call duties. Troubleshoot complex incidents, perform root cause analysis, collaborate across teams, support deployments in on-prem and cloud (AWS/Azure), and ensure SLA compliance while participating in Agile/SAFe processes.
Top Skills:
AWSAzureC#GitJavaPostgresPythonSQL
Artificial Intelligence • Robotics • Automation • Manufacturing
Lead the Platform & SRE team to design and operate a unified deployment platform spanning cloud and on-premise. Architect Pulumi/Kubernetes-based deployments, package software for industrial hardware, build GitHub Actions CI/CD (including HIL), and define observability with Prometheus, Grafana, and OIDC.
Top Skills:
AWSAzureCertificate ManagementCnisGCPGithub ActionsGoGrafanaHelmHilKubernetesLinux NetworkingMulti-Cluster ManagementOidcPrometheusPulumiTerraform
Cloud
The Site Reliability Engineer at TCN will design, deploy, and maintain systems for performance, reliability, and security, while managing incidents and collaborating with teams.
Top Skills:
BashGoGoogle Cloud PlatformJavaKubernetesLinuxNode.jsPythonRuby
Artificial Intelligence • Cloud • Information Technology • Software
Contribute to the reliability and performance of Mithril's GPU orchestration platform through automation, observability, and infrastructure management. Collaborate with the team to ensure scalability across multi-cloud environments while maintaining systems stability and implementing SLOs.
Top Skills:
AWSAzureGCPGoGrafanaKubernetesLinuxOpentelemetryPrometheusPulumiPythonTcp/IpTerraform
Healthtech • Information Technology • Software
The Sr. Database Site Reliability Engineer manages the reliability and performance of Azure PostgreSQL platforms, applying SRE principles for automation and observability. Responsibilities include incident response, backup strategies, and ensuring compliance with security standards.
Top Skills:
ArgocdAzure PostgresqlCi/CdDatadogGitHelmKubernetesTerraform
Fintech
The Site Reliability Engineer will monitor and manage Kubernetes clusters, optimize Cloud Infrastructure, and automate processes using tools like Terraform and Docker.
Top Skills:
Amazon S3AWSAzureC/C++CephDockerGCPHdfsHelmJavaJavaScriptKubernetesNfsPostgresPythonRubyTerraform
Information Technology
As a Senior Site Reliability Engineer, you will enhance system resilience, automate tasks, and ensure robust infrastructure for national security.
Top Skills:
ConfluenceDockerGitGoJavaJenkinsJIRAKubernetesLinuxNessusPackerPythonRust
New
Cut your apply time in half.
Use ourAI Assistantto automatically fill your job applications.
Use For Free
Cloud • Information Technology • Security • Software
Provide 24/7 technical support for a SaaS AI security platform, monitor uptime, triage and resolve incidents, collaborate with customers and development teams, and drive SRE-focused automation and reliability improvements.
Top Skills:
Ai InferenceAWSGrafanaHTTPJSONKubernetesLinuxPostgresPrometheusPythonRest ApisSaaSTerraformTicketing System
Artificial Intelligence • Information Technology • Software • Automation
Own US PST coverage for releases and incidents as the first SRE; bridge infrastructure and code by working with Kubernetes, Terraform, and AWS and patching Elixir when needed; lead incident response and post-mortems; define SLOs and observability; author runbooks and support HIPAA-aligned compliance for a regulated medical-device platform.
Top Skills:
AWSElixirKubernetesTerraform
Artificial Intelligence • Marketing Tech • Software • Big Data Analytics
Design, operate, and automate global network and reliability infrastructure for large-scale ML workloads and a private supercomputer. Own device configuration management, protocols (BGP, VPNs, WAN), datacenter fabrics, monitoring/SLOs, incident response, security/compliance, and cross-team reliability improvements.
Top Skills:
AirflowAnsibleBashBgpBluefieldCniCumulus LinuxDatadogEcmpElkEvpn/VxlanFirewallsGrafanaInfinibandInfobloxIngressIpsec VpnsIscsiKafkaKubernetesLinuxLoad BalancersLustrefsMplsNetboxNetwork PolicyNfsNornirOpentelemetryPrometheusPythonQosService NetworkingSparkSpectrum-XSpine-LeafSwitchesTerraformVpnsWan Circuits
Big Data • Software
Maintain and improve service reliability and scalability by monitoring performance, troubleshooting production issues, implementing SLOs, performing capacity analysis, and developing automation and self-service tools. Participate in reliability testing, resilience evaluation, and release management while collaborating with teams to align monitoring and SLOs with user expectations.
Top Skills:
Aws AlbAws NlbAzure Application GatewayBgpDhcpDnsEigrpF5FirewallFlow LogsFortigateIds/IpsIp Addressing/SubnettingLoad BalancerMicro-SegmentationNaclsNatNetwork Monitoring PlatformsNginxOspfPacket AnalyzerPrivatelinkSecurity GroupsVnetVpcVpc PeeringVpnVpn TechnologiesWaf
Artificial Intelligence • Cloud • Software
The Senior Site Reliability Engineer will automate operations, improve workflows, manage secure infrastructure, and participate in on-call rotation for an AI-driven company.
Top Skills:
AristaAWSBashCephChefCifsCiscoDnsDockerElk StackFortinetHpHTTPIcmpIpIscsiJenkinsKubernetesLinux/DebianMesosphereNfsNode.jsPivotal GreenplumPostgresPythonRabbitMQRaidRubyS3ScyllaSshSslSupermicroTcpTlsUbuntu
Artificial Intelligence • Information Technology • Software • Infrastructure as a Service (IaaS)
Own fleet-scale reliability, upgradeability, and operational excellence for an edge platform. Design and operate automated, secure lifecycle systems, observability, and incident response. Lead cross-domain, high-severity incident ownership, set production standards (SLOs/SLIs, change management), mentor engineers, and apply AI to accelerate diagnostics and operational workflows.
Top Skills:
Ai SystemsAudit LoggingCanary DeploymentsConfiguration ManagementFleet ManagementInfrastructure-As-CodeKubernetesLinuxObservabilitySecure BootStaged Rollouts
Healthtech
The SRE will design and implement platform solutions, maintain cloud environments, monitor and troubleshoot production issues, and automate tasks to improve efficiency.
Top Skills:
AnsibleAWSDockerGCPGitIacLinuxMySQLPHPTerraform
Information Technology • Automation
The SRE/Infrastructure Engineer will architect and manage secure, scalable systems for automated penetration testing, optimizing reliability, and enhancing infrastructure based on customer demand. Responsibilities include maintaining production environments, leading technical discussions, and promoting high coding standards.
Top Skills:
AWSAzureCloudFormationElkGCPNew RelicOpentelemetryPostgresPrometheusTerraform
AdTech
Lead and manage Site Reliability Engineering operations, focusing on system reliability, developer productivity, and strategic leadership to enhance service stability and streamline software delivery processes.
Top Skills:
SparkAWSCircleCIDatadogJavaScriptKubernetesPythonSplunkTerraform
Fintech • Financial Services
As a Site Reliability Engineer for AI platform, you will ensure system reliability, develop automation, manage infrastructure, and evaluate new technologies.
Top Skills:
AnsibleApache KafkaAWSAzureCloudFormationDatadogDockerEfkElkGoGCPGpu ClustersGrafanaHelmJavaKubernetesPrometheusPythonSnowflakeSparkTerraform
Artificial Intelligence • Information Technology • Consulting
The Linux Systems Administrator will maintain and troubleshoot Linux systems, support network services, and work on systems integration while collaborating with infrastructure teams.
Top Skills:
DhcpDnsLinuxNtpPython
HR Tech • Information Technology • Professional Services • Software • Business Intelligence • Consulting • Automation
Seeking a Site Reliability Engineer with expertise in Unix/Linux, scripting languages, and experience in containerization, cloud platforms, and application monitoring tools.
Top Skills:
AnsibleApache TomcatAWSCassandraChefCoradiantDockerDynatraceElasticGomezGCPJenkinsKafkaLinuxMq SeriesOraclePuppetPythonShell ScriptingSplunkTealeafUnixVagrantWebsphere
Reposted 22 Days AgoSaved
Fintech
The Principal AI Site Reliability Engineer leads initiatives to enhance operational efficiency and reliability for contact center applications, utilizing AI-driven automation and observability tools. They collaborate with teams to minimize manual intervention and proactively monitor system performance while mentoring and guiding teams towards excellence.
Top Skills:
AIAutomationAWSAzureBashGCPIt Service ManagementMonitoring ToolsObservabilityPowershellPython
Let Your Resume Do The Work
Upload your resume to be matched with jobs you're a great fit for.
Success! We'll use this to further personalize your experience.
Popular Job Searches
All Software Engineer Jobs
.NET Developer Jobs
Aerospace Thermal Engineering Jobs
AI Engineer Jobs
Android Developer Jobs
Automation Engineer Jobs
Backend Developer Jobs
Blockchain Developer Jobs
C# Jobs
C++ Jobs
Cloud Architect Jobs
Cloud Engineer Jobs
Design Engineer Jobs
DevOps Engineer Jobs
Director Of Engineering Jobs
Electrical Engineering Jobs
Embedded Software Engineer Jobs
Engineering Jobs
Engineering Manager Jobs
Environmental Engineering Jobs
Field Engineer Jobs
Front End Developer Jobs
Full Stack Developer Jobs
Game Developer Jobs
Golang Jobs
Hardware Engineer Jobs
Industrial Engineering Jobs
iOS Developer Jobs
Java Developer Jobs
Javascript Developer Jobs
Linux Jobs
Manufacturing Engineer Jobs
Mechanical Engineering Jobs
Network Engineer Jobs
PHP Developer Jobs
Process Engineer Jobs
Project Engineer Jobs
Prompt Engineering Jobs
Python Jobs
QA Jobs
Robotics Engineer Jobs
Ruby on Rails Jobs
Salesforce Administrator Jobs
Salesforce Developer Jobs
Scala Jobs
Sharepoint Developer Jobs
Site Reliability Engineer Jobs
Software Engineering Manager Jobs
Solutions Architect Jobs
SQL Developer Jobs
Structural Engineer Jobs
System Engineer Jobs
Test Engineer Jobs
Web Developer Jobs
All Filters
Total selected ()
No Results
No Results

















.png)















