Boson AI

Site Reliability Engineer, AI/ML Infrastructure

Posted 4 Days Ago

Santa Clara, CA

In-Office

150K-250K Annually

Senior level

Artificial Intelligence • Machine Learning

The Role

As a Senior Site Reliability Engineer, you will manage HPC cluster operations, deploy infrastructure-as-code solutions, support research teams, and develop automation tools.

Summary Generated by Built In

About The Role

We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers.

You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing, deploying, and keeping everything running smoothly. That means troubleshooting issues as they arise, monitoring performance, developing automation to make our lives easier, and working closely with engineering and science teams to ensure they have what they need. You'll also help us plan for future capacity and evaluate new technologies as we continue to scale.

Responsibilities

Manage and optimize HPC cluster operations
Deploy and maintain infrastructure-as-code solutions
Support ML/research teams with cluster usage optimization
Operate, troubleshoot and optimize Ceph storage clusters.
Develop automation and tooling

Minimum Qualifications

5+ years of experience in SRE or HPC operations.
Proficiency in Linux systems administration (Ubuntu/Debian).
Experience with Kubernetes and container orchestration
Experience with Ceph >1PB deployments and maintenance
Knowledge of security best practices in multi-tenant environments.
Understanding of L2/L3 networking fundamentals
Skilled in Python and Bash scripting.

Preferred Qualifications

Experience with infrastructure-as-code tools (Ansible/Terraform).
Experience with GitOps (Helm, ArgoCD).
Strong grasp of RDMA, InfiniBand, and GPUDirect technologies
Familiarity with deep learning frameworks such as PyTorch and TensorFlow.
Familiarity in at least one cloud platform: AWS, Azure or GCP.

If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.

Top Skills

Ansible

AWS

Azure

Bash

Ceph

GCP

Gitops

Gpudirect

Infiniband

Kubernetes

Linux

Python

Rdma

Terraform

View all jobs at Boson AI

View Boson AI Profile

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

Santa Clara,, CA

21 Employees

Year Founded: 2023

What We Do

We are transforming how stories are told, knowledge is learned, and insights are gathered

Similar Jobs

BlackLine

Sales Executive

Cloud • Fintech • Information Technology • Machine Learning • Software • App development • Generative AI

Remote or Hybrid

California, USA

1810 Employees

76K-90K Annually

Cisco ThousandEyes

Leader, Software Engineering - ThousandEyes

Cloud • Software

Hybrid

San Francisco, CA, USA

1100 Employees

189K-351K Annually

Wells Fargo

Teller Full Time

Fintech • Financial Services

Hybrid

Carmel, CA, USA

213000 Employees

22-28 Hourly

Wells Fargo

Personal Banker Temecula

Fintech • Financial Services

Hybrid

Temecula, CA, USA

213000 Employees

23-31 Hourly

Similar Companies Hiring

Standard Template Labs Thumbnail

Standard Template Labs

Software • Information Technology • Artificial Intelligence

New York, NY

10 Employees

Scotch Thumbnail

Scotch

Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics

US

25 Employees

Idler Thumbnail

Idler

Artificial Intelligence

San Francisco, California

6 Employees

View all jobs at Boson AI

View Boson AI Profile

Oops, something went wrong. Please try again.