Top Site Reliability Engineer Jobs

Reposted 16 Days AgoSaved
Hybrid
Denver, CO, USA
110K-145K Annually
Senior level
110K-145K Annually
Senior level
Information Technology • Insurance • Software
Responsible for the reliability and performance of production services, managing SLIs and SLOs, and leading incident responses while collaborating with various teams.
Top Skills: .NetAWSC#Ci/CdJavaKubernetesLinuxPythonReactWindows
Reposted 16 Days AgoSaved
Remote or Hybrid
CO, USA
110K-145K Annually
Senior level
110K-145K Annually
Senior level
Information Technology • Insurance • Software
The Sr. Site Reliability Engineer at Vertafore will own the reliability and performance of production services, design incident response protocols, and enhance system observability while applying software engineering practices.
Top Skills: .NetAWSC#Ci/CdJavaKubernetesLinuxPythonReactWindows
Reposted 17 Days AgoSaved
Hybrid
San Francisco, CA, USA
160K-250K Annually
Senior level
160K-250K Annually
Senior level
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Lead design and delivery of scalable cloud infrastructure for the Spend product. Embed with development teams to drive reliability, performance, observability, incident response, and automation. Own SLOs, runbooks, DevOps metrics, and collaborate with central DevOps and security teams to ensure compliance and resilience. Lead infrastructure projects including new service launches, data centre migrations, and modernising data pipelines.
Top Skills: Analytics PipelinesAWSData StreamingDevOpsGCPIncident ResponseKubernetesObservabilitySlosSre
Reposted 11 Days AgoSaved
In-Office or Remote
San Mateo, CA, USA
Junior
Junior
Cloud • Information Technology
The Site Reliability Engineer II role involves ensuring the stability and reliability of services, automating operational tasks, and collaborating with teams for system design while promoting reliability practices.
Top Skills: AnsibleAWSAzureBashCatchpointDockerElkGCPGoGrafanaJenkinsKubernetesPrometheusPythonTerraform
Reposted 11 Days AgoSaved
In-Office
Atlanta, GA, USA
Mid level
Mid level
Healthtech • Software
The Site Reliability Engineer (SRE) will enhance platform reliability and scalability through AI-driven automation, collaborate with product engineers, and manage incidents, monitoring, and documentation processes.
Top Skills: AWSCi/CdTerraform
Reposted 12 Days AgoSaved
Remote
USA
Mid level
Mid level
Other
As a Site Reliability Engineer, you will design cloud platforms, automate operations, maintain infrastructure, and support engineering teams in delivering reliable services.
Top Skills: AnsibleAWSAzureBashCircleCICloudFormationDatadogDnsDockerGitlab CiGoGCPGrafanaHTTPHttpsJenkinsKubernetesKvmLinuxPerlPrometheusPythonRubyTcp/IpTerraformUnixVMware
Reposted 12 Days AgoSaved
In-Office
Los Angeles, CA, USA
130K-145K Annually
Mid level
130K-145K Annually
Mid level
Events
The Site Reliability Engineer II designs and maintains scalable systems, focusing on automation, monitoring, incident response, and collaboration with developers to enhance operational practices and efficiency.
Top Skills: BashCloud Service OperationsContainersContinuous DeliveryContinuous IntegrationGoInfrastructure As CodeOrchestration PlatformsPython
Reposted 12 Days AgoSaved
Hybrid
Denver, CO, USA
95K-136K Annually
Senior level
95K-136K Annually
Senior level
Artificial Intelligence • Cloud • Events • Productivity • Software • Business Intelligence • Conversational AI
Maintain and improve uptime, availability, and performance of services via observability, redundancy, failover, and load‑balancing. Integrate monitoring into SDLC, lead incident response/on‑call, assess capacity and risks, and work with teams to extend observability and automate self‑healing.
Top Skills: AlertmanagerAnsibleArgocdAWSAzureBashElkGCPGitlabGitlab CiGoGrafanaJavaJavaScriptJenkinsKafkaKubernetesLinuxMongoDBMySQLNginxPostgresPrometheusPythonTerraformVictoriametricsZabbix
Reposted 12 Days AgoSaved
Hybrid
New York, NY, USA
Senior level
Senior level
Artificial Intelligence
Seeking an experienced Site Reliability Engineer to enhance platform reliability, scalability, and performance by balancing operations with long-term software engineering improvements.
Top Skills: AIBashDatadogDockerElk StackFluxGoGrafanaKubernetesPrometheusPythonTerraform
Reposted 12 Days AgoSaved
Remote or Hybrid
2 Locations
165K-330K Annually
Mid level
165K-330K Annually
Mid level
Software
As an AI Support Engineer, you'll manage support requests, resolve user issues, optimize ML models, and contribute to product development.
Top Skills: Tensorrt
Reposted 12 Days AgoSaved
In-Office or Remote
2 Locations
165K-225K Annually
Senior level
165K-225K Annually
Senior level
Artificial Intelligence • Cloud • Information Technology • Software
Build and operate production-grade AI infrastructure using Kubernetes, ensuring high availability, reliability, and performance. Develop custom operators and implement automation for efficient operations and monitoring.
Top Skills: AnsibleBashElk StackEnterprise Storage SystemsGrafanaHigh-Performance NetworkingKubernetesLinuxNvidia Gpu TechnologiesPrometheusPythonTerraform
Reposted 12 Days AgoSaved
Remote
United States
120K-160K Annually
Senior level
120K-160K Annually
Senior level
Healthtech • Other • Software
As a Senior Database Site Reliability Engineer, you'll design, implement, and maintain PostgreSQL systems, ensure reliability, automate maintenance tasks, and participate in incident response.
Top Skills: AnsibleBashDatadogGrafanaNew RelicPostgresPowershellPrometheusPythonTerraform
New

Cut your apply time in half.

Use ourAI Assistantto automatically fill your job applications.

Use For Free
Application Tracker Preview
Reposted 12 Days AgoSaved
In-Office or Remote
Basel, KS, USA
160K-185K Annually
Senior level
160K-185K Annually
Senior level
Software
Technical leader responsible for reliability, scalability, performance, and operational excellence of a cloud SaaS platform. Drive platform modernization to containers/Kubernetes on Azure, define SLOs/SLAs, lead observability, incident response/RCA, automation/tooling, and mentor engineers while ensuring compliance with public-sector standards.
Top Skills: AnsibleArgo CdBashClaude CodeDistributed TracingFedrampFluxGitGitGithub CopilotHipaaKubernetesLinuxLoggingAzureMonitoringObservability PlatformsOpentelemetryPci-DssPowershellPythonSoc 2StaterampTerraformVm-Based ArchitecturesWindows
Reposted 12 Days AgoSaved
Remote
USA
114K-148K Annually
Senior level
114K-148K Annually
Senior level
Software • Financial Services
Ensure platform reliability, performance, and availability by implementing observability, automating infrastructure, participating in on-call rotations and post-mortems, partnering with Product and Engineering, designing scalable architectures, mentoring teammates, and integrating Dynatrace with Azure DevOps and Jira while supporting compliance (SOC/FedRAMP).
Top Skills: .NetAksAlpineAnsibleAppinsightsArm TemplatesAWSAzure DevopsBashBicepC#ChefCloudFormationDatadogDebianDynatraceEksGCPGitGitGksGrafanaHelmJIRAKubernetesLog AnalyticsAzureNew RelicOnestream SoftwareOpenshiftPowershellPowershell DscPrometheusPuppetPythonRest ApisSQLTerraformUbuntu
Reposted 12 Days AgoSaved
Remote
USA
Senior level
Senior level
Fintech • Information Technology
As a Site Reliability Engineer at Alpaca, you will ensure system reliability and performance, troubleshoot issues, and collaborate with teams to design scalable features.
Top Skills: GoGormLinuxPgxPostgresPrometheusSqlc
Reposted 12 Days AgoSaved
Remote
USA
Senior level
Senior level
Gaming • Software
The Site Reliability Engineer will manage infrastructure stability and scalability, lead cloud migrations, and optimize performance across systems while mentoring team members.
Top Skills: AnsibleAWSAzureBashChefCloudFormationDatadogDockerElk StackGCPGoGrafanaKubernetesPrometheusPuppetPythonTerraformUnix/Linux
Reposted 12 Days AgoSaved
In-Office
Omaha, NE, USA
Mid level
Mid level
Healthtech • Insurance
Owner of enterprise observability and SRE practices: define SLOs/SLA measurement, drive MTTR reduction, lead incident response, maintain service dependency maps and reliability dashboards, and leverage AI/AIOps to automate triage, root cause analysis, and self-healing remediation across vendor and internal platforms.
Top Skills: Ai/AiopsBashChaos EngineeringCi/CdCmdbDashboardingData ModelingDistributed TracingInfrastructure-As-CodeItsm/Ticketing SystemsLog AggregationMonitoring PlatformsObservability PlatformsPowershellPythonSIEMTelemetry
Reposted 12 Days AgoSaved
In-Office
3 Locations
Mid level
Mid level
Hardware • Other • Software • Appliances • Industrial • Manufacturing
Develop and maintain UIs and APIs using Next.js and .NET. Implement AWS services, apply SRE principles, and contribute to CI/CD pipelines.
Top Skills: .NetAWSAws CloudformationC#DockerEc2Entity FrameworkGrafanaKubernetesLambdaNext.JsPrometheusRdsReactS3Terraform
12 Days AgoSaved
In-Office
Pleasant Grove, UT, USA
Expert/Leader
Expert/Leader
Hardware • Internet of Things
Lead architecture and implementation of enterprise-scale infrastructure and automation for web, mobile, backend, and data teams. Define reliability standards, incident response and DR strategies, optimize performance with advanced observability, and mentor engineering teams while driving SRE best practices across the organization.
Top Skills: AWSGCPGoIamKubernetesNode.jsObservabilityPythonTerraform
12 Days AgoSaved
Remote
United States
150K-210K Annually
Senior level
150K-210K Annually
Senior level
Artificial Intelligence • Cloud • Information Technology • Software • Big Data Analytics
Founding Staff SRE for Volcano: define SLOs/error budgets, architect multi-region Kubernetes infrastructure, build GitOps/CI-CD with ArgoCD/Helm/Terraform, scale managed Postgres/Redis/object storage, implement observability with Datadog/Prometheus/Grafana, lead incident response and SRE culture, and mentor cross-functional teams.
Top Skills: ArgocdCanary DeploymentsCi/CdCniDatadogGitopsGrafanaHelmIngressKubernetesObject StoragePostgresPrometheusRedisService MeshTerraformTerragrunt
12 Days AgoSaved
In-Office or Remote
San Francisco, CA, USA
Expert/Leader
Expert/Leader
Artificial Intelligence • Information Technology • Software • Automation
Lead technical vision as a principal engineer, either managing teams or driving cross-team initiatives. Design and architect cloud infrastructure, networking, and security; define authentication/authorization patterns; architect and operate Kubernetes deployments; and implement infrastructure-as-code using tools like Terraform, CloudFormation, Ansible, or Puppet.
Top Skills: AnsibleAWSCloudFormationGCPIamKubernetesPuppetRbacSecurity GroupsTerraform
Reposted 12 Days AgoSaved
In-Office
Alpharetta, GA, USA
Senior level
Senior level
Fintech • Financial Services
Lead Site Reliability Engineer responsible for production support, automating deployments, monitoring availability and performance, troubleshooting infrastructure and applications, driving reliability improvements, collaborating with development and infrastructure teams, and participating in 24/7 on-call rotation.
Top Skills: AutosysAWSAzureC#Ci/CdContainersDb2Generative Ai ToolsIp SoftJavaJenkinsLinuxMqOraclePerlPythonRubyShellSockeyeSplunkSybaseTrainUnixVirtual MachinesWeb ServicesWindows
Reposted 12 Days AgoSaved
In-Office
2 Locations
Senior level
Senior level
Fintech • Analytics
As a Site Reliability Engineer, you will ensure the reliability and performance of a FX trading platform, develop automation, improve system health, and manage SLOs while collaborating with development teams.
Top Skills: AWSAzureBashC#JavaKubernetesPythonSQL
Reposted 12 Days AgoSaved
In-Office
San Francisco, CA, USA
255K-490K Annually
Mid level
255K-490K Annually
Mid level
Artificial Intelligence • Machine Learning • Generative AI
As a Site Reliability Engineer, you will manage Kubernetes clusters, automate infrastructure, improve operational metrics, and enhance reliability across data centers.
Top Skills: CloudFormationGoGpuKubernetesLinuxPythonTerraform
Reposted 12 Days AgoSaved
In-Office
Omaha, NE, USA
Mid level
Mid level
Software
As a Site Reliability Engineer, you'll optimize monitoring and alerting systems, enhance user experience, and support teams with actionable insights and automation.
Top Skills: AnsibleAWSAzureBashDatadogElk StackGCPGitGrafanaJenkinsNagiosNew RelicPowershellPrometheusPythonTerraform
All Filters
JobType
New Jobs
Job Category
Experience
Industry
Company Name
Company Size

Sign up now Access later

Create Free Account