Top Site Reliability Engineer Jobs

Reposted 12 Days AgoSaved
In-Office
New York, NY, USA
140K-225K Annually
Senior level
140K-225K Annually
Senior level
Fintech
Lead adoption of SRE practices to improve reliability, observability, automation, and incident response. Implement and maintain observability tooling, instrumentation, CI/CD, and infrastructure-as-code. Partner with developers, participate in on-call rotations, drive postmortems, and reduce operational overhead through automation.
Top Skills: AnthropicAWSAws EcsAws EksAzureC#DockerGitlab CiGrafanaLinuxOpenaiPrometheusPuppetPythonSplunkTerraformTypescriptWindows
Reposted 12 Days AgoSaved
In-Office
Los Angeles, CA, USA
164K-270K Annually
Mid level
164K-270K Annually
Mid level
Aerospace • Hardware • Software • Defense • Manufacturing
As a Site Reliability Engineer, you'll ensure robotics system reliability, build telemetry integration, and develop tools for diagnostics and automation, collaborating with engineering teams for enhanced production reliability.
Top Skills: C++DatadogGoKubernetesOpentelemetryPrometheusPythonRos2TelegrafTypescript
Reposted 12 Days AgoSaved
Remote
2 Locations
175K-275K Annually
Mid level
175K-275K Annually
Mid level
Software
As a Site Reliability Engineer, you'll enhance system reliability, collaborate on production readiness, define SLIs/SLOs, and improve incident response.
Top Skills: AWSDatadogGrafanaKubernetesOpentelemetryPrometheusTypescript
Reposted 13 Days AgoSaved
Hybrid
2 Locations
Mid level
Mid level
Fintech • Financial Services
The Site Reliability Engineer will support cloud infrastructure, automate deployments, and ensure operational efficiency and governance across public cloud platforms.
Top Skills: AnsibleAWSAzureAzure CliAzure FunctionsAzure Kubernetes ServiceCosmodbGCPGitJenkinsKubernetesLinuxPowershellTerraformWindows
Reposted 13 Days AgoSaved
Remote or Hybrid
4 Locations
148K-249K Annually
Senior level
148K-249K Annually
Senior level
Transportation
Design and develop Waabi's observability stack, optimize performance, build automation tooling, and support application requirements while leading projects and mentoring teams.
Top Skills: AWSC/C++DockerGoGrafanaJavaKubernetesOpentelemetryPythonRust
Reposted 13 Days AgoSaved
In-Office
2 Locations
112K-137K Annually
Senior level
112K-137K Annually
Senior level
Fintech
The Site Reliability Engineer will manage AWS infrastructures, oversee application deployments, and ensure system reliability and security while collaborating with teams.
Top Skills: AWSBashCodebuildCodedeployCodepipelineEc2IamPythonRdsRoute 53S3TerraformVpc
Reposted 13 Days AgoSaved
In-Office
San Francisco, CA, USA
Senior level
Senior level
Artificial Intelligence • Healthtech
The Site Reliability Engineer will enhance system reliability, define observability standards, respond to incidents, and collaborate with engineering teams on performance and compliance improvements.
Top Skills: AWSContainerized ServicesDistributed WorkflowsObservability ToolingPostgresServerless Compute
Reposted 19 Days AgoSaved
In-Office or Remote
La Crosse, WI, USA
92K-164K Annually
Mid level
92K-164K Annually
Mid level
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
The Senior Observability Engineer maintains monitoring systems, designs log aggregation solutions, automates tasks with scripts, and ensures platform performance.
Top Skills: AnsibleBashDynatraceElasticsearchElkFilebeatFluentbitFluentdGrafanaLinuxLogstashOtelPowershellPrometheusPythonTerraform
Reposted 14 Days AgoSaved
In-Office or Remote
Washington, DC, USA
Entry level
Entry level
Fintech • Information Technology • Professional Services • Software
The Site Reliability Engineer serves as a consultant for Taxwell, focusing on ensuring the reliability and performance of their tax preparation software.
Reposted 14 Days AgoSaved
In-Office
San Francisco, CA, USA
175K-250K Annually
Mid level
175K-250K Annually
Mid level
Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
The Site Reliability Engineer will ensure the reliability and performance of AI infrastructure, build core systems, handle incident response, and develop automation tools.
Top Skills: AWSDatadogElkGCPGithub ActionsGitlab CiGoGrafanaJenkinsKubernetesLinuxPrometheusPulumiPythonRustTerraform
Reposted 14 Days AgoSaved
In-Office
Hawthorne, CA, USA
125K-175K Annually
Mid level
125K-175K Annually
Mid level
Aerospace • Other
The Site Reliability Engineer, GNC at SpaceX oversees mission-critical GNC products, operates servers, maintains HPC clusters, and enhances services and infrastructure to support space operations.
Top Skills: AnsibleBazelDockerGradleKubernetesLinuxMakeNpmPipPuppetPythonTerraformVagrant
Reposted 14 Days AgoSaved
In-Office
San Jose, CA, USA
Mid level
Mid level
Artificial Intelligence • Hardware • Machine Learning • Natural Language Processing • Software • Generative AI
As a Cloud Site Reliability Engineer, you will ensure the reliability, performance, and scalability of AI inferencing services, participate in on-call rotations, manage cloud infrastructure, and automate CI/CD processes while collaborating on incident management and capacity planning.
Top Skills: ArgocdCloudFormationDatadogDockerElk StackGithub ActionsGoGrafanaJavaJenkinsKubernetesPrometheusPythonTerraform
New

Cut your apply time in half.

Use ourAI Assistantto automatically fill your job applications.

Use For Free
Application Tracker Preview
Reposted 15 Days AgoSaved
In-Office
Westminster, CO, USA
106K-145K Annually
Mid level
106K-145K Annually
Mid level
Hardware • Information Technology • Other • Software • Analytics
Architect and operate ML/agent pipelines and infrastructure, deploy and monitor models at scale, pioneer MLOps/Agent Ops best practices, collaborate with domain experts, and test/optimize ML systems for production reliability and cost efficiency.
Top Skills: Bash ScriptingContainerization (E.G.Docker)Git/GithubLinuxModel VersioningMonitoringNumpyPandasPythonPyTorchScikit-Learn
15 Days AgoSaved
In-Office or Remote
17 Locations
Mid level
Mid level
Information Technology • Software • Web3 • Infrastructure as a Service (IaaS)
Operate and improve the Pod platform: respond to incidents, investigate root causes, build automation and observability, design monitoring/alerting, reduce alert fatigue, and drive reliability improvements across production systems.
Top Skills: BashCi/CdCloudDockerGrafanaLinuxPagerdutyPrometheusPythonRust
15 Days AgoSaved
In-Office
Seattle, WA, USA
180K-240K Annually
Senior level
180K-240K Annually
Senior level
Artificial Intelligence • Software • Generative AI
Lead reliability, scalability, and operational health of a production platform. Evolve Kubernetes, CI/CD, IaC, and observability. Build tooling and automation, improve monitoring/incident response, partner with engineering to identify and mitigate scaling risks, and influence platform direction across reliability, security, performance, and cost.
Top Skills: Ci/CdCloud-Native ArchitectureContainer OrchestrationGitopsGpu ProvisioningIncident ResponseInfrastructure As CodeKubernetesLoggingMetricsMulti-CloudObservabilityPythonTracingTypescript
Reposted 15 Days AgoSaved
In-Office
New York, NY, USA
147K-310K Annually
Expert/Leader
147K-310K Annually
Expert/Leader
Fintech • Financial Services
The Director of Splunk Platform Engineering & SRE owns the enterprise Splunk platform, drives incident resolution, optimizes systems, and mentors engineers, focusing on automation and performance.
Top Skills: AnsibleGitGoJavaKubernetesLinux/UnixMoogPrometheusPythonSplunk
Reposted 15 Days AgoSaved
Remote
United States
100K-110K Annually
Mid level
100K-110K Annually
Mid level
Healthtech • Software
The SRE Technical Project Manager will lead project delivery, incident management, automation processes, and uptime communication, partnering with SRE and development teams to ensure system stability and scalability.
Top Skills: Ai BotsDatadogJIRAJira Service ManagementMs TeamsOpsgeniePagerduty
Reposted 15 Days AgoSaved
Hybrid
Camden, NJ, USA
130K-150K Annually
Mid level
130K-150K Annually
Mid level
Information Technology • Logistics • Transportation • Analytics • Business Intelligence • 3PL: Third Party Logistics • Industrial
As a Site Reliability Engineer, you'll enhance reliability for Phenix WMS and automation systems, focusing on incident reduction and system health through observability and automation. Responsibilities include defining SLIs and SLOs, participating in incident response, and testing disaster recovery plans.
Top Skills: AnsibleAzureBashCi/CdKubernetesPowershellPythonTerraform
Reposted 15 Days AgoSaved
In-Office
Wacker, IL, USA
94K-157K Annually
Junior
94K-157K Annually
Junior
Financial Services
As a Site Reliability Engineer II, you will build, operate, and scale systems for CME Group's Clearing portfolio. Responsibilities include collaborating with teams, monitoring services, scripting for efficiency, and improving system performance, particularly during the migration to Google Cloud Platform.
Top Skills: BashGoogle Cloud PlatformGrafanaKubernetesLinuxOpentelemetryPrometheusPythonSplunk
Reposted 15 Days AgoSaved
In-Office
2 Locations
164K-222K Annually
Expert/Leader
164K-222K Annually
Expert/Leader
Security
The Director of DevSecOps and SRE will lead teams in SRE, Cloud Infrastructure, and DevOps practices, focusing on automation, infrastructure reliability, and security policies while mentoring engineers and managing software projects.
Top Skills: Aws Cloud TechnologiesGitlabGrafanaJavaKubernetesLokiMaterial UiPostgresPrometheusRabbitMQReactReduxSentrySpringTailwindTerraform
Reposted 15 Days AgoSaved
In-Office
San Jose, CA, USA
175K-250K Annually
Senior level
175K-250K Annually
Senior level
Artificial Intelligence • Robotics • Automation • Manufacturing
Responsible for managing and setting up internal systems infrastructure, migrating SaaS to self-hosted solutions, implementing monitoring systems, and ensuring security compliance.
Top Skills: AnsibleAWSAzureCloudFormationDatadogDnsGCPGrafanaHTTPLinux/UnixPrometheusTcp/IpTerraform
Reposted 15 Days AgoSaved
In-Office
Santa Clara, CA, USA
168K-334K Annually
Expert/Leader
168K-334K Annually
Expert/Leader
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Responsible for developing incident management guidelines, supporting production systems, defining reliability metrics, and driving automation for high service availability.
Top Skills: GoGrafanaPerlPrometheusPythonRuby
Reposted 15 Days AgoSaved
In-Office
Herndon, VA, USA
87K-198K Annually
Senior level
87K-198K Annually
Senior level
Information Technology
As a Site Reliability Engineer, you'll develop resilient infrastructure, automate tasks, handle incident response, and support classified environments for the Intelligence Community.
Top Skills: ArgocdBitbucketElasticsearchGitlabJava SpringbootKafkaKubernetesMongoDBNifi
Reposted 15 Days AgoSaved
In-Office
2 Locations
62K-141K Annually
Mid level
62K-141K Annually
Mid level
Information Technology
The Site Reliability Engineer will enhance infrastructure resilience, automate processes, and implement monitoring tools to support the Intelligence Community.
Top Skills: AWSConfluenceDockerGitJenkinsJIRAKubernetesLinuxNessusPacker
Reposted 16 Days AgoSaved
Remote
USA
110K-140K Annually
Senior level
110K-140K Annually
Senior level
Real Estate • Financial Services • PropTech
Support and optimize products migrated to AWS, implement cloud best practices, maintain operational coverage, enhance automation, observability, CI/CD/GitOps, and security. Collaborate with development and platform teams to scale, troubleshoot, and ensure reliable SaaS operations.
Top Skills: AmisArgocdAWSAws Elastic BeanstalkAws Transfer FamilyAzure DevopsBashCloudwatchCurlDockerEc2EksFluxcdGitGitopsHTTPIstioKubernetesLinkerdLoad BalancerPowershellPythonRdsSQLTerraformWget
All Filters
JobType
New Jobs
Job Category
Experience
Industry
Company Name
Company Size

Sign up now Access later

Create Free Account