Senior Site Reliability Engineer

Reposted Yesterday
2 Locations
In-Office or Remote
Senior level
Artificial Intelligence • Computer Vision • Machine Learning
The Role
As a Senior Site Reliability Engineer, you will own AWS infrastructure, ensure reliability, and enhance deployment processes, primarily working with Kubernetes and Terraform.
Summary Generated by Built In
About the Role

Before an autonomous vehicle navigates a busy intersection, before a robot learns to pick and place in a warehouse, before any Physical AI system is trusted in the real world, it has to prove itself in ours. Parallel Domain builds the platform that validates the next generation of autonomous systems in high-fidelity virtual environments, and the infrastructure underneath that platform is what makes simulation at scale possible.

We're hiring a Senior Site Reliability Engineer to help build and operate that infrastructure. This role sits at the core of how we run large-scale, distributed simulation workloads for autonomous-systems testing and validation. You'll work across multi-region AWS infrastructure, operate Kubernetes at scale, and contribute directly to reliability, security, and deployment systems that the rest of the engineering org depends on.

This is a hands-on role with the broad ownership typical of a startup. You'll partner closely with platform, simulation, and ML teams to keep the system running smoothly and evolving. We're growing the team—two of these roles are open—and the work is substantive: multi-region GPU scheduling, Windows workloads on Kubernetes, large-scale batch simulation, and an enterprise product direction that will require rethinking parts of how we deploy and operate.

Responsibilities

  • Infrastructure ownership and cloud operations. Design, build, and maintain multi-region AWS infrastructure using Terraform. Operate and scale EKS clusters across production regions: autoscaling, node lifecycle, workload health. Manage networking across environments: VPC design, DNS, load balancing, and cross-region connectivity. Support infrastructure changes, migrations, and expansions into new regions. Contribute to and improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize.

  • Reliability engineering and incident response. Help build and run incident management processes: severity definitions, escalation paths, on-call practices. Lead incident response, debugging, and root-cause analysis. Write postmortems and drive systemic reliability improvements from what they surface. Improve observability across metrics, logging, tracing, and dashboards. Support GPU and batch workloads running on Kubernetes.

  • Security and access management. Provide security-conscious feedback on platform architecture decisions. Own cloud IAM governance: roles, policies, and access boundaries across accounts and services. Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires.

  • Platform tooling and developer experience. Improve CI/CD pipelines and infrastructure validation. Support engineers with infrastructure debugging, environment setup, and performance issues. Contribute to tooling and automation in Python and Bash. Take on adjacent responsibilities as needed in a startup environment.
  •  

Required Qualifications

  • Experience. 5+ years in SRE, DevOps, or infrastructure engineering roles, with a track record of operating production systems across multiple regions.

  • Terraform. Modules, state management, and multi-environment patterns.

  • AWS depth. Solid experience across VPC, IAM, EKS, S3, and CloudWatch.

  • Kubernetes expertise. Cluster operations, autoscaling, RBAC, and Helm.

  • CI/CD and GitOps. Experience with GitHub Actions, ArgoCD, or similar workflows.

  • Networking fundamentals. CIDR, DNS, load balancing, VPN, and cross-region connectivity.

  • Observability. Experience with tooling such as Prometheus and Grafana.

  • Scripting. Comfort with Python and Bash for tooling and automation.

  • Cross-platform familiarity. Working knowledge of both Linux and Windows environments. Operational experience supporting Windows-based workloads is a meaningful advantage.

  • Pragmatism and ownership. Comfortable in a fast-moving startup with evolving priorities. You take ownership of systems while collaborating closely with other teams, and you're pragmatic about tradeoffs between speed, reliability, and complexity.
  •  

Preferred Qualifications

  • Windows on Kubernetes. Experience with Windows node pools, Windows AMIs, and GPU-adjacent components on K8s.

  • GPU scheduling. Familiarity with GPU scheduling on Kubernetes, including NVIDIA device plugin configuration.

  • Domain workloads. Experience supporting simulation, ML, or rendering workloads in cloud infrastructure.

  • AWS extras. Exposure to AWS Storage Gateway, Active Directory integrations, or AWS Transfer Family.

  • Service mesh. Familiarity with service proxy or service mesh patterns.

  • Container OS. Experience with container-optimized OS images (e.g., Bottlerocket, Packer).

  • Cost optimization. Cloud cost optimization at scale.
  •  

Core Tools

    Terraform · AWS · Kubernetes · Helm · Kustomize · ArgoCD · GitHub Actions · Prometheus · Grafana · Docker · Python · Bash

What Makes a Great Candidate

    You think in failure modes and proactively surface issues. You hold a principled view on security and push back constructively when designs introduce unnecessary risk. You communicate clearly across engineering, product, and customer-facing teams, flagging issues with urgency proportional to customer impact. You take end-to-end ownership of complex efforts and know when to push for the clean solution versus the pragmatic one.

Skills Required

  • 5+ years in SRE, DevOps, or infrastructure engineering roles
  • Experience with Terraform modules and state management
  • Solid experience across AWS services like VPC, IAM, EKS
  • Expertise in Kubernetes operations including autoscaling
  • Experience with CI/CD tools like GitHub Actions
  • Understanding networking fundamentals such as CIDR, DNS
  • Familiarity with observability tools like Prometheus and Grafana
  • Proficient in scripting with Python and Bash
  • Operational experience supporting Windows workloads
  • Experience with GPU scheduling on Kubernetes
  • Familiarity with service mesh patterns
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Palo Alto, CA
67 Employees
Year Founded: 2017

What We Do

Training and testing autonomous systems in the real world is a slow, expensive and cumbersome process. Parallel Domain is the smartest way to prepare both your machines and human operators for the real world, while minimizing the time and miles spent there. Connect to the Parallel Domain API and tap into the power of synthetic data to accelerate your autonomous system development. Parallel Domain works with perception, machine learning, data operations, and simulation teams at autonomous systems companies, from autonomous vehicles to delivery drones. Our platform generates synthetic labeled data sets, simulation worlds, and controllable sensor feeds so they can develop, train, and test their algorithms safely before putting these systems into the real word. #syntheticdata #autonomy #AI #computervision #AV #ADAS #machinelearning

Similar Jobs

ScalePad Logo ScalePad

Senior Site Reliability Engineer

Information Technology • Software
In-Office or Remote
Vancouver, BC, CAN
224 Employees

Block Logo Block

Senior Site Reliability Engineer

Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency
In-Office or Remote
8 Locations
12000 Employees
161K-284K Annually

Circle (circle.so) Logo Circle (circle.so)

Senior Site Reliability Engineer

Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
Easy Apply
Remote
31 Locations
250 Employees
130K-140K Annually

SoundHound Logo SoundHound

Senior Site Reliability Engineer

Natural Language Processing • Software • Conversational AI
Remote
Canada
345 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account