Site Reliability Engineer - Automation

Posted 6 Days Ago
Be an Early Applicant
Memphis, TN
In-Office
Senior level
Information Technology
The Role
As an SRE - Automation Specialist, you will automate firmware upgrades, develop scripts, identify issues, and enhance datacenter efficiency while collaborating with cross-functional teams.
Summary Generated by Built In
About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As a Site Reliability Engineer in Automation , you will focus on automating firmware upgrades, scripting solutions for hardware from key vendors like NVIDIA, Dell, Supermicro, and HP, and proactively identifying issues to implement automated fixes. Leveraging skills in Python, Bash, Linux, and Kubernetes, you will enhance datacenter efficiency, reduce manual interventions, and support scalable AI infrastructure at xAI.

Responsibilities
  • Develop and maintain scripts in Python and Bash for handling firmware packages, performing upgrades, and automating the entire process across Linux and Kubernetes environments.
  • Work with hardware from vendors such as NVIDIA, Dell, Supermicro, and HP to ensure seamless firmware integration, testing, and deployment in the datacenter.
  • Identify operational problems in real-time, design automated fixes or workflows to resolve them, and implement scalable solutions to prevent recurrence.
  • Collaborate with Datacenter Operations Technicians to deploy automation tools, troubleshoot firmware-related issues, and optimize processes for high-availability systems.
  • Integrate automation scripts into CI/CD pipelines or orchestration tools like Kubernetes for efficient scaling and management.
  • Monitor and refine automated processes, ensuring they align with datacenter reliability goals and minimize downtime.
  • Document automation scripts, firmware upgrade procedures, and problem-solving approaches to build a reusable knowledge base for the team.
  • Participate in on-call rotations and incident response, applying automation to accelerate resolutions in the Memphis datacenter.
Required Qualifications
  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • 5+ years of experience in site reliability engineering or automation roles, preferably in datacenter or cloud environments.
  • Proficiency in Python, Bash, Linux, and Kubernetes for scripting, automation, and orchestration.
  • Hands-on experience with firmware packages, including writing scripts for upgrades and automating deployment processes.
  • Familiarity with hardware from vendors like NVIDIA, Dell, Supermicro, and HP, including integration and troubleshooting in production settings.
  • Strong problem-solving skills with a proven ability to identify issues and automate fixes to improve system efficiency.
  • Experience in high-performance computing or AI infrastructure environments.
  • Excellent collaboration skills for working with cross-functional teams in fast-paced settings.
Preferred Qualifications
  • Experience automating firmware management in large-scale datacenters or supercomputing clusters.
  • Knowledge of additional tools like Ansible, Terraform, ArgoCD or additional containerization tools for enhanced automation.
  • Prior work in a startup or tech company like xAI, with contributions to scalable automation systems.

xAI is an equal opportunity employer.

California Consumer Privacy Act (CCPA) Notice

Top Skills

Ansible
Argocd
Bash
Kubernetes
Linux
Python
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
96 Employees

What We Do

Understand the Universe

Similar Jobs

In-Office
11 Locations
6282 Employees
125K-150K Annually

General Motors Logo General Motors

Manufacturing Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
Spring Hill, TN, USA
165000 Employees

Comcast Advertising Logo Comcast Advertising

Account Executive

AdTech • Digital Media • Marketing Tech
Hybrid
Nashville, TN, USA
5000 Employees

PwC Logo PwC

Architect

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Remote or Hybrid
69 Locations
370000 Employees
91K-322K Annually

Similar Companies Hiring

Axle Health Thumbnail
Logistics • Information Technology • Healthtech • Artificial Intelligence
Santa Monica, CA
17 Employees
Scrunch AI Thumbnail
Software • SEO • Marketing Tech • Information Technology • Artificial Intelligence
Salt Lake City, Utah
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account