Site Reliability Engineer - US Government

Reposted 3 Days Ago
Easy Apply
Be an Early Applicant
2 Locations
In-Office
180K-440K Annually
Senior level
Information Technology
The Role
Design and operate secure infrastructure for government projects. Optimize performance, manage storage with IaC tools, and ensure system reliability in high-security environments.
Summary Generated by Built In
About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

ABOUT THE ROLE:

We are seeking a highly skilled Senior Infrastructure Engineer to join our US Government Team, focused on designing, building, and operating secure, scalable infrastructure for critical government projects. In this role, you will develop and manage training and inference clusters, as well as highly reliable applications, across bare metal, classified cloud, and hybrid cloud architectures. You will leverage your expertise in Kubernetes and GPU hardware to deliver robust, secure systems that support large-scale AI workloads while meeting stringent federal compliance requirements. This role demands a passion for automation, observability, and ensuring system integrity in a fast-paced, high-security environment.

RESPONSIBILITIES:
  • Develop and optimize software to provision and manage xAI’s infrastructure across on-premise, virtual machine, and classified cloud environments, enabling efficient scaling for US government initiatives.
  • Enhance the reliability, performance, and cost-effectiveness of infrastructure to support large-scale AI and application workloads in secure, classified settings.
  • Collaborate with xAI engineers to understand workload requirements and design tailored solutions that meet government-specific needs and compliance standards.
  • Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems, adhering to federal protocols.
  • Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible, with a focus on secure data handling.
  • Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs, while maintaining security and compliance.
  • This is an in-person role based in Palo Alto, CA or Washington, DC, with up to 50% travel required.
BASIC QUALIFICATIONS:
  • Active Top Secret (TS) security clearance.
  • 5+ years of experience as an Infrastructure Engineer, Site Reliability Engineer, or similar role, with a focus on building and maintaining reliable, scalable systems, preferably in secure or government environments.
  • Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible.
  • Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components.
  • Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs.
  • Excellent communication and documentation skills, with the ability to handle sensitive information concisely and accurately.
PREFERRED SKILLS AND EXPERIENCE:
  • Deep familiarity with installing and using GPU hardware, including setting up drivers, debugging issues, and ensuring reliability.
  • Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments in classified or federal settings.
  • Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience in government projects.
  • Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation.
  • Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges in secure environments.
  • Passion for problem-solving and a proactive drive to deliver impactful results while adhering to security protocols.
  • Certifications in security-related fields (e.g., CISSP) or experience in secure federal environments.
COMPENSATION AND BENEFITS:

$180,000 - $440,000 USD

Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.

Top Skills

Ansible
Argocd
Go
Gpu Hardware
Kubernetes
Kyverno
Pulumi
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
96 Employees

What We Do

Understand the Universe

Similar Jobs

Alembic Logo Alembic

Senior Site Reliability Engineer

Artificial Intelligence • Marketing Tech • Software • Big Data Analytics
In-Office
San Francisco, CA, USA
43 Employees
210K-240K Annually

Crunchyroll Logo Crunchyroll

Director, SVOD & Membership Strategy

Digital Media • eCommerce • Gaming • Mobile • News + Entertainment
Hybrid
Los Angeles, CA, USA
1300 Employees
190K-220K Annually

Alchemy Logo Alchemy

Senior Product Manager

Blockchain • Cloud • Fintech • Information Technology • Software • Cryptocurrency • Web3
Easy Apply
Hybrid
2 Locations
250 Employees
160K-220K Annually

Chime Logo Chime

Senior Full-stack Engineer

Fintech • Machine Learning • Mobile • Security • Software
Easy Apply
Hybrid
San Francisco, CA, USA
1500 Employees
187K-259K Annually

Similar Companies Hiring

Axle Health Thumbnail
Logistics • Information Technology • Healthtech • Artificial Intelligence
Santa Monica, CA
19 Employees
Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account