Site Reliability Engineer - High Performance Computing / AI-ML

Sorry, this job was removed at 08:13 p.m. (CST) on Monday, Aug 18, 2025
5 Locations
In-Office
120K-297K Annually
Social Media • Software
We serve the public conversation. We believe real change starts with conversation.
The Role
Role: Site Reliability Engineer - HPC / AI-ML (All Levels)
Location: Palo Alto, New York, Seattle or Austin
Base Salary Range: $120,000 to $297,000 + Equity

_

Who We Are:

At X, we’re pioneering the frontier of technology with our innovative Everything App. Our mission is to revolutionize how people connect, share ideas, and engage in meaningful conversations. We champion freedom of speech and strive to create a platform that embraces diverse perspectives. Our commitment is to foster open dialogue and empower individuals to express themselves freely.

What You’ll Do:

As a Site Reliability Engineer (SRE) supporting HPC (High Performance Computing) + AI/ML initiatives at X, you will play a crucial role in maintaining and enhancing the reliability, availability, and performance of our large-scale systems. Your responsibilities will include:

  • Managing and troubleshooting large scale clusters to ensure the stability and efficiency of our platform (primarily Linux + Kubernetes)

  • Collaborating with cross-functional teams, including hardware engineers and software developers, to support and improve our infrastructure

  • Automating the provisioning and deployment of systems to enhance long-term health and scalability

  • Ensuring the robustness of our HPC environments and storage clusters

  • Writing and maintaining scripts and tools for automation and monitoring

  • Addressing system failures and performance issues, identifying root causes, and implementing preventive measures

  • Working closely with end-users to understand changing needs as our environment evolves. 

Who You Are:

We're looking for exceptional engineers who are passionate about our mission and have a strong desire to make a meaningful impact. The ideal candidate will have:

  • 2+ years of professional software development experience 

  • Extensive experience with Kubernetes and container orchestration

  • Proficiency in one or more object-oriented programming languages (e.g. Python, Java, C++, Scala)

  • Proficiency in scripting languages (Python, Bash, etc.)

  • Strong experience in configuration management (e.g., puppet, ansible, chef, etc.)

  • Familiarity with Ethernet networking at scale and distributed systems

  • Strong troubleshooting skills and experience with HPC environments

  • Experience managing large-scale systems, ideally supporting thousands of machines

  • Working understanding of the storage systems required to support such environments

  • Experience with various GPU / accelerator architectures and ability to optimize performance on such platforms.

  • Ability to think outside the box and come up with innovative solutions to complicated problems.

  • Extremely committed, willing to work in a fast paced environment

  • Excellent communication and interpersonal skills

At X, our small but fast-paced team values innovation, creativity, and a strong commitment to our mission. As a Site Reliability Engineer, you'll have the opportunity to make a significant impact on the future of X and our aspiration to build the Everything App.

Similar Jobs

CrowdStrike Logo CrowdStrike

Regional Partner Services Manager (Remote)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
2 Locations
10000 Employees
60K-95K Annually

Motorola Solutions Logo Motorola Solutions

Solutions Architect

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Remote or Hybrid
Texas, USA
23000 Employees
105K-115K Annually

King's Hawaiian Logo King's Hawaiian

Account Manager

Food • Retail • Sales • Manufacturing
Remote or Hybrid
TX, USA
1411 Employees

Apryse Logo Apryse

Commercial Account Executive

Productivity • Software • App development • Automation
In-Office or Remote
4 Locations
665 Employees
137K-160K Annually
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
1,500 Employees
Year Founded: 2006

What We Do

We serve the public conversation. We believe real change starts with conversation. That’s why it matters to us that people have a free and safe space to talk. We put people first. Be you, really. That’s how we build trust. Together we’re creating a culture that’s supportive, respectful, and a pretty cool vibe. Sure, we’re not perfect, we’re people. But we’re open and honest about who we are and what we do.

Why Work With Us

Life’s not about a job, it’s about purpose. We believe real change starts with conversation. Here, your voice matters. Come as you are and together we’ll do what’s right (not what’s easy) to serve the public conversation.

Gallery

Gallery

Similar Companies Hiring

PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees
Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account