Top Site Reliability Engineer Jobs

8 Days AgoSaved
Remote or Hybrid
US
180K-240K Annually
Senior level
180K-240K Annually
Senior level
Aerospace • Defense
Lead design, implementation, and operation of scalable, secure hybrid-cloud infrastructure for satellite ground systems. Improve developer experience, automate CI/CD and IaC, own observability, troubleshoot reliability issues, and collaborate with developers and satellite operators to advance SatDevOps practices.
Top Skills: C/C++Ci/CdGCPGoGrafanaInfrastructure As Code (Iac)JavaKubernetesLokiPrometheusPythonRustSoftware Defined Networking (Sdn)
8 Days AgoSaved
Remote
United States
160K-180K Annually
Senior level
160K-180K Annually
Senior level
Software
Own and improve platform performance, reliability, and deployment automation. Manage cloud infrastructure, implement IaC, monitor systems with observability tools, provide operational support for distributed applications, and integrate production learnings into development workflows.
Top Skills: Aiops ToolingAws Elastic ContainersAws RdsAws S3Claude CodeClaude CoworkDatadogHarness EngineeringInfrastructure As CodeKubernetesLlmsPrompt EngineeringRigorSplunk
Reposted 8 Days AgoSaved
In-Office
Reston, VA, USA
133K-238K Annually
Senior level
133K-238K Annually
Senior level
Cloud • Fintech • HR Tech
The Senior Site Reliability Engineer will ensure platform health, automate operations, maintain security, and support development teams, optimizing CI/CD processes and collaborating across time zones.
Top Skills: Amazon Web ServicesArgo CdC#GoKubernetesPythonRubyRustTerraform
Reposted 8 Days AgoSaved
Remote
United States
Senior level
Senior level
Real Estate • Software
As a Senior Site Reliability Engineer, you'll enhance system performance and reliability, optimize databases, and implement AI-assisted solutions for operational efficiency.
Top Skills: AnsibleDatadogElkGrafanaKubernetesLinuxMariadbMySQLPostgresPrometheusPuppetPythonRuby on RailsRubyTerraformTerragrunt
9 Days AgoSaved
In-Office
3 Locations
147K-278K Annually
Senior level
147K-278K Annually
Senior level
Cloud • Information Technology • Internet of Things • Professional Services • Software
Operate and scale ThousandEyes Federal region infrastructure in a FedRAMP-compliant AWS environment. Design, deploy, and automate cloud-native services, implement IaC, monitor and audit systems, collaborate with security teams to remediate vulnerabilities, participate in 24x7 incident response and capacity planning, and ensure platform reliability, performance, and compliance.
Top Skills: AWSFedrampGoKubernetesLinuxPuppetPythonTerraformUnixUs Govcloud
Reposted 9 Days AgoSaved
In-Office
Washington, DC, USA
185K-230K Annually
Senior level
185K-230K Annually
Senior level
Information Technology • Consulting
As a Senior Site Reliability Engineer, you'll design and maintain critical applications, develop CI/CD pipelines, and ensure high availability while leading incident response and providing innovative solutions to meet customer needs.
Top Skills: AnsibleBashDesired State ConfigurationGitlab Ci/CdKubernetesVMware
Reposted 9 Days AgoSaved
In-Office
Charlotte, NC, USA
160K-180K Annually
Senior level
160K-180K Annually
Senior level
Fintech • Financial Services
The Senior Site Reliability Engineer will enhance system reliability, automate operations, ensure compliance, and collaborate with engineering teams to improve production systems at AssetMark.
Top Skills: Alerting ToolsAWSAzureC#Ci/CdDockerGCPInfrastructure-As-CodeJavaKubernetesLogging ToolsMonitoring ToolsPythonTracing Tools
Reposted 9 Days AgoSaved
In-Office
Hawthorne, CA, USA
160K-220K Annually
Senior level
160K-220K Annually
Senior level
Aerospace • Other
The Sr. Site Reliability Engineer at SpaceX is responsible for enhancing distributed systems, managing large data clusters, and ensuring software reliability on the Starlink project, focusing on customer experience and operational efficiency.
Top Skills: Apache KafkaC#FlinkGoHbaseHdfsIstioJavaKubernetesLinuxPythonScalaSpark
Reposted 9 Days AgoSaved
In-Office
Boston, MA, USA
134K-215K Annually
Senior level
134K-215K Annually
Senior level
Artificial Intelligence • Cloud • Social Impact • Software • Wearables
The Senior Site Reliability Engineer ensures the reliability and performance of cloud-native Kubernetes platforms by building tools, facilitating self-service for engineers, and promoting best practices.
Top Skills: ArgocdAWSAzureC#Ci/CdGitGoJavaKubernetesPulumiPythonTerraform
Reposted 9 Days AgoSaved
In-Office
Atlanta, GA, USA
Senior level
Senior level
Artificial Intelligence • Cloud • Social Impact • Software • Wearables
Design and build cloud infrastructure, automate platforms, mentor engineers, and enhance reliability and performance for Axon's products.
Top Skills: ApmAWSAzureCi/CdCloudFormationGoKubernetesPythonTerraform
Reposted 9 Days AgoSaved
In-Office
Boston, MA, USA
150K-180K Annually
Senior level
150K-180K Annually
Senior level
Artificial Intelligence • Cloud • Social Impact • Software • Wearables
As a Senior Site Reliability Engineer, you will design cloud infrastructure, develop automation tools, write production code, and mentor engineers while managing multi-cloud environments and improving reliability.
Top Skills: ApmAWSAzureCdkCi/CdCloudFormationGoKubernetesPythonTerraform
Reposted 9 Days AgoSaved
In-Office
Seattle, WA, USA
150K-180K Annually
Senior level
150K-180K Annually
Senior level
Artificial Intelligence • Cloud • Social Impact • Software • Wearables
As a Senior Site Reliability Engineer, you'll design cloud infrastructure, lead automation initiatives, and enhance operational efficiency while mentoring others and handling incident responses.
Top Skills: AWSAzureCi/CdCloudFormationGoKubernetesPythonTerraform
New

Track Smarter, Apply Better.

Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.

Use For Free
Application Tracker Preview
Reposted 9 Days AgoSaved
Remote
United States
141K-208K Annually
Senior level
141K-208K Annually
Senior level
Database • Analytics
This role involves ensuring the reliability and performance of ClickHouse's cloud infrastructure, collaborating with engineering teams, incident management, and driving continuous improvement in service availability.
Top Skills: AnsibleAWSAzureClickhouseDocker SwarmGoGoogle Cloud PlatformKubernetesPuppetPythonTerraform
10 Days AgoSaved
In-Office
New York City, NY, USA
156K-262K Annually
Senior level
156K-262K Annually
Senior level
Artificial Intelligence • Information Technology • Consulting
Own and operate production infrastructure: manage Kubernetes across regions, maintain IaC and GitOps CI/CD workflows, optimize real-time data pipelines, build observability and alerting, debug incidents, and lead cloud cost and capacity planning for a small engineering team.
Top Skills: Alerting)Ci/CdGitopsKubernetesMetricsObservability (LoggingTerraform
10 Days AgoSaved
In-Office
San Jose, CA, USA
Senior level
Senior level
Software
Lead architecture, design, and evolution of a global multi-region cloud SRE platform for GPU/AI compute. Author and maintain platform architecture, enforce design invariants, review framework changes, run plugin framework, decide tier placements, coordinate with cloud teams and security, produce pre-flight designs, and shepherd implementations through engineering squads.
Top Skills: BmcDcgmDdnGitopsGpu OperatorInfinibandIpmiKuberayKubernetesKueueLustreMigNcclNetappNvlinkNvme-OfNvswitchPureRayRedfishRoceSlurmSubnet ManagerVastVgpuVolcanoXidZtp
10 Days AgoSaved
In-Office
San Jose, CA, USA
Senior level
Senior level
Software
Lead design and implement a global public cloud SRE platform for AI and compute workloads. Own architecture and production engineering for observability, cluster health, remediation, lifecycle, secrets, CI/CD, backup/DR, and automation. Collaborate with cross-functional teams to build scalable, reliable multi-region services and run them in production (on-call).
Top Skills: ArgoAws KmsBmcCosignCrdtDatadogDcgmDdnElasticsearchFluxGcp KmsGoHashicorp VaultHelmInfinibandIpmiJaegerJavaKuberayKubernetesKubernetes Operator (Crd/Controller)KueueKustomizeLokiLustreMimirMtlsNcclNetappNvme-OfOpentelemetryPaxosPrometheusPrometheus QueryPurePythonRaftRayRedfishRoceRustSlurmSQLTempoThanosVastVictoriametricsVolcano
10 Days AgoSaved
Remote
United States
Senior level
Senior level
Information Technology • Security • Cybersecurity
Operate and harden regulated cloud platforms (FedRAMP/DoD IL) by owning production reliability, designing resilient infrastructure, leading incident response and postmortems, automating compliance (NIST 800-53/STIG), supporting ATO and continuous monitoring, building secure IaC and CI/CD pipelines, and improving observability and operational tooling.
Top Skills: Aws GovcloudBashCi/CdContainer HardeningDod Il4Dod Il5Fedramp HighGitopsGoGrafanaImage SecurityKubernetesLinux/UnixNist 800-53PrometheusPythonStigTerraform
10 Days AgoSaved
In-Office
Palo Alto, CA, USA
Senior level
Senior level
Fintech • Payments • Software • Financial Services
Lead Site Reliability Engineer responsible for ensuring platform scalability and uptime on AWS. Own CI/CD and GitHub repository practices, run deployment pipelines, manage incidents and post-mortems, implement observability and logging, and coordinate technical alignment across US and international teams with bilingual communication.
Top Skills: AlertingAWSCi/CdDeployment PipelinesGitGitGithub ActionsLog ManagementMonitoring ToolsObservabilityScripting
10 Days AgoSaved
In-Office
San Francisco, CA, USA
Senior level
Senior level
Fintech • Payments • Software • Financial Services
Senior SRE responsible for ensuring platform scalability, reliability, and runtime efficiency on AWS. Own CI/CD and GitHub repo workflows, lead incident response and post-mortems, implement observability/monitoring and logging, and collaborate cross-border using bilingual Mandarin and English.
Top Skills: AlertingAWSCi/CdDeployment PipelinesGitGithub ActionsLoggingMonitoringObservabilityScripting
Reposted 10 Days AgoSaved
In-Office
55445, Minneapolis, MN, USA
98K-176K Annually
Senior level
98K-176K Annually
Senior level
eCommerce • Other • Retail
As a Senior Site Reliability Engineer, you will build and support platforms for reliable digital experiences, improve system reliability, and guide technical decisions within the team.
Top Skills: AWSAzureBashDockerFastlyGCPGitGithub ActionsGoKubernetesNext.JsNode.jsReact
Reposted 10 Days AgoSaved
In-Office
El Segundo, CA, USA
183K-235K Annually
Senior level
183K-235K Annually
Senior level
Artificial Intelligence • Machine Learning • Security • Software
The Senior Staff Site Reliability Engineer will be responsible for ensuring system reliability, debugging issues, mentoring the engineering team, and maintaining infrastructure and CI/CD pipelines.
Top Skills: AWSDatadogDockerGithub ActionsGrafanaHelmKotlinKubernetesPostgresPrometheusPythonRustTerraformTerragruntTypescript
11 Days AgoSaved
Remote
United States
135K-170K Annually
Senior level
135K-170K Annually
Senior level
Big Data • Analytics
Own production reliability for customer-facing radar and weather data services across Azure, colocation, and edge Kubernetes. Refactor C#/.NET services for multi-replica safety, design multi-cluster HA, operate self-managed Kubernetes, improve observability and automation, lead incident response and postmortems, and drive operational excellence and capacity planning.
Top Skills: .NetAnsibleC#DatadogGpu-Enabled WorkloadsGrafanaHelmIstioKubernetesLokiLonghornAzureNatsOctopus DeployOpentelemetryPostgisPostgresPrometheusRabbitMQRancherRke2Terraform
Reposted 11 Days AgoSaved
Remote
US
Senior level
Senior level
Artificial Intelligence
Own operational excellence for cloud infrastructure: run incident management, improve reliability through automation, own a platform domain (e.g., Kubernetes, Temporal, observability), manage vendor and cost relationships, and deliver measurable reductions in incidents and costs within 12 months.
Top Skills: AWSKubernetesLlm ApisMongoDBObservabilityPythonTemporal
Reposted 11 Days AgoSaved
In-Office
San Francisco, CA, USA
117K-209K Annually
Senior level
117K-209K Annually
Senior level
Big Data • Cloud • Digital Media • Machine Learning • Mobile • Software • Industrial
Lead reliability for Autodesk GovCloud services by deploying, operating, and automating production systems. Define SLOs/SLIs, build observability and automation, run incident response and on-call rotation, ensure compliance (FedRAMP), perform resilience testing and toil reduction, and collaborate across engineering, security, and platform teams to improve service reliability and operability.
Top Skills: APIsAWSAws GovcloudAzureBashCaching TechnologiesCi/CdCloudwatchContainersDatabasesDatadogDnsDynatraceFedrampGoIl4Il5Infrastructure As CodeJavaKubernetesLoad BalancingMessaging SystemsNetworkingPowershellPythonSplunkStorage Platforms
Reposted 11 Days AgoSaved
Remote
Idaho, USA
117K-209K Annually
Senior level
117K-209K Annually
Senior level
Big Data • Cloud • Digital Media • Machine Learning • Mobile • Software • Industrial
Lead reliability for production services in Autodesk GovCloud: deploy, operate, and automate cloud services; define SLOs/SLIs and observability; drive incident response, resilience testing, and toil reduction; ensure compliance (FedRAMP) and participate in 24x7 on-call rotation.
Top Skills: APIsAWSAws GovcloudAzureBashCi/CdCloudwatchContainersDatadogDnsDynatraceGoInfrastructure As CodeJavaKubernetesLoad BalancingNetworkingPowershellPythonSplunk
All Filters
JobType
New Jobs
Job Category
Experience
Industry
Company Name
Company Size

Sign up now Access later

Create Free Account