Site Reliability Engineer - Hardware

Reposted 9 Days Ago
Easy Apply
Be an Early Applicant
Memphis, TN
In-Office
Senior level
Information Technology
The Role
As a Hardware Specialist SRE, you'll ensure hardware reliability for xAI, analyze firmware, manage vendor relations, and evaluate emerging technologies in datacenter operations.
Summary Generated by Built In
About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As a Site Reliability Engineer focused on Hardware, you will serve as an expert focused on firmware, hardware specifications, vendor relations, and failure analysis. You will proactively identify and resolve hardware issues, manage RMA processes, and stay ahead of emerging hardware technologies to support xAI's datacenter operations. This role demands deep technical expertise in hardware diagnostics, vendor negotiations, and forward-looking hardware evaluation.

Responsibilities
  • Analyze firmware packages and hardware specifications for upcoming releases to ensure compatibility, performance, and reliability in xAI's datacenter environment.
  • Investigate and diagnose hardware failures, including "grey failures" (ambiguous or intermittent issues), proving them as true hardware defects through rigorous testing and data analysis.
  • Manage vendor relationships, including initiating RMA (Return Merchandise Authorization) claims, negotiating beyond standard processes when necessary, and holding vendors accountable for resolutions.
  • Collaborate with Datacenter Operations Technicians to troubleshoot, repair, and optimize hardware systems in real-time.
  • Research and evaluate next-generation hardware technologies that are not yet released, providing insights and recommendations to inform xAI's infrastructure roadmap.
  • Develop and implement monitoring tools, scripts, and processes to detect hardware anomalies early and minimize downtime.
  • Document failure modes, RMA outcomes, and hardware evaluations to build a knowledge base for the team.
  • Participate in on-call rotations and incident response for hardware-related issues in the Memphis datacenter.
Required Qualifications
  • Bachelor's degree in Systems Engineering, Electrical Engineering, Computer Science, or a related field (or equivalent experience).
  • 5+ years of experience in hardware reliability engineering, preferably in high-performance computing or datacenter environments.
  • Proven expertise in firmware analysis, hardware specifications review, and release validation.
  • Strong experience with RMA processes, including filing claims, vendor negotiations, and pushing for resolutions outside standard protocols.
  • Demonstrated ability to diagnose and prove complex hardware failures, including grey or intermittent issues, using tools, logic analyzers, or diagnostic software.
  • Familiarity with datacenter hardware components (e.g., servers, GPUs, networking equipment) and emerging technologies.
  • Proficiency in scripting languages (e.g., Python, Bash) for automation and analysis.
  • Excellent problem-solving skills with a data-driven approach to reliability engineering.
  • Ability to work collaboratively with cross-functional teams, including operations technicians.
Preferred Qualifications
  • Experience in AI/ML infrastructure or supercomputing environments.
  • Knowledge of vendor ecosystems (e.g., NVIDIA, Dell, HP, Supermicro) and supply chain management.
  • Certifications in hardware engineering or reliability (e.g., CRE, CompTIA Server+).
  • Prior work in a fast-paced startup or tech company like xAI.

xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.

Top Skills

Bash
Firmware Analysis
Python
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
96 Employees

What We Do

Understand the Universe

Similar Jobs

Cox Enterprises Logo Cox Enterprises

Dealer.com Performance Manager

Automotive • Cloud • Greentech • Information Technology • Other • Software • Cybersecurity
Remote or Hybrid
TN, USA
50000 Employees
75K-113K Annually

Motorola Solutions Logo Motorola Solutions

Systems Engineering Group Lead

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Remote or Hybrid
Tennessee, USA
23000 Employees
130K-150K Annually

Chamberlain Group Logo Chamberlain Group

Sales Representative

Automotive • Hardware • Internet of Things • Mobile • Software • App development • PropTech
Remote or Hybrid
2 Locations
5769 Employees
80K-131K Annually

Tempus AI Logo Tempus AI

(Senior) Medical Science Liaison - Bluegrass (US, Nashville)

Artificial Intelligence • Big Data • Healthtech • Machine Learning • Analytics • Biotech • Generative AI
Remote or Hybrid
2 Locations
3775 Employees
120K-190K Annually

Similar Companies Hiring

Axle Health Thumbnail
Logistics • Information Technology • Healthtech • Artificial Intelligence
Santa Monica, CA
17 Employees
Scrunch AI Thumbnail
Software • SEO • Marketing Tech • Information Technology • Artificial Intelligence
Salt Lake City, Utah
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
15 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account