Site Reliability Engineer - Automation

Sorry, this job was removed at 06:17 p.m. (CST) on Wednesday, Feb 04, 2026
Easy Apply
Be an Early Applicant
Memphis, TN, USA
In-Office
Information Technology
The Role
About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As a Site Reliability Engineer in Automation , you will focus on automating firmware upgrades, scripting solutions for hardware from key vendors like NVIDIA, Dell, Supermicro, and HP, and proactively identifying issues to implement automated fixes. Leveraging skills in Python, Bash, Linux, and Kubernetes, you will enhance datacenter efficiency, reduce manual interventions, and support scalable AI infrastructure at xAI.

Responsibilities
  • Develop and maintain scripts in Python and Bash for handling firmware packages, performing upgrades, and automating the entire process across Linux and Kubernetes environments.
  • Work with hardware from vendors such as NVIDIA, Dell, Supermicro, and HP to ensure seamless firmware integration, testing, and deployment in the datacenter.
  • Identify operational problems in real-time, design automated fixes or workflows to resolve them, and implement scalable solutions to prevent recurrence.
  • Collaborate with Datacenter Operations Technicians to deploy automation tools, troubleshoot firmware-related issues, and optimize processes for high-availability systems.
  • Integrate automation scripts into CI/CD pipelines or orchestration tools like Kubernetes for efficient scaling and management.
  • Monitor and refine automated processes, ensuring they align with datacenter reliability goals and minimize downtime.
  • Document automation scripts, firmware upgrade procedures, and problem-solving approaches to build a reusable knowledge base for the team.
  • Participate in on-call rotations and incident response, applying automation to accelerate resolutions in the Memphis datacenter.
Required Qualifications
  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • 5+ years of experience in site reliability engineering or automation roles, preferably in datacenter or cloud environments.
  • Proficiency in Python, Bash, Linux, and Kubernetes for scripting, automation, and orchestration.
  • Hands-on experience with firmware packages, including writing scripts for upgrades and automating deployment processes.
  • Familiarity with hardware from vendors like NVIDIA, Dell, Supermicro, and HP, including integration and troubleshooting in production settings.
  • Strong problem-solving skills with a proven ability to identify issues and automate fixes to improve system efficiency.
  • Experience in high-performance computing or AI infrastructure environments.
  • Excellent collaboration skills for working with cross-functional teams in fast-paced settings.
Preferred Qualifications
  • Experience automating firmware management in large-scale datacenters or supercomputing clusters.
  • Knowledge of additional tools like Ansible, Terraform, ArgoCD or additional containerization tools for enhanced automation.
  • Prior work in a startup or tech company like xAI, with contributions to scalable automation systems.

xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.

Similar Jobs

Samsara Logo Samsara

Manager II Sales Operations

Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
Easy Apply
Remote or Hybrid
United States
4000 Employees
111K-168K Annually

Superhuman Logo Superhuman

Senior Procurement Specialist

Artificial Intelligence • Information Technology • Machine Learning • Natural Language Processing • Productivity • Software • Generative AI
Easy Apply
Remote or Hybrid
2 Locations
1500 Employees
118K-163K Annually

MetLife Logo MetLife

Account Manager

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
43000 Employees
90K-90K Annually

MetLife Logo MetLife

Technical Product Manager

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
43000 Employees
140K-180K Annually
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
96 Employees

What We Do

Understand the Universe

Similar Companies Hiring

Axle Health Thumbnail
Logistics • Information Technology • Healthtech • Artificial Intelligence
Santa Monica, CA
19 Employees
Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account