Senior Distributed Systems Engineer

Sorry, this job was removed at 10:06 p.m. (CST) on Monday, Feb 03, 2025
Be an Early Applicant
Palo Alto, CA
In-Office
Information Technology
The Role

We are seeking highly skilled engineers with expertise in machine learning, distributed systems, and high-performance computing to join our Research team. In this role, you will collaborate closely with researchers to build and optimize platforms that train next-generation foundation models on massive GPU clusters. Your work will play a critical role in advancing the efficiency and scalability of cutting-edge generative AI technologies.

Key Responsibilities

  • Scale and optimize systems for training large-scale models across multi-thousand GPU clusters.
  • Profile and enhance the performance of training codebases to achieve best-in-class hardware efficiency.
  • Develop systems to distribute workloads efficiently across massive GPU clusters.
  • Design and implement robust solutions to enable model training in the presence of hardware failures.
  • Build tools to diagnose issues, visualize processes, and evaluate datasets at scale.
  • Optimize and deploy inference workloads for throughput and latency across the entire stack, including data processing, model inference, and parallel processing.
  • Implement and improve high-performance CUDA, Triton, and PyTorch code to address efficiency bottlenecks in memory, speed, and utilization.
  • Collaborate with researchers to ensure systems are designed with optimal efficiency from the ground up.
  • Prototype cutting-edge applications using multimodal generative AI.

Qualifications

  • Experience:
    • 3+ years of professional experience in ML pipelines, distributed systems, or high-performance computing.
    • Hands-on experience training large models using Python and PyTorch, with familiarity in the full pipeline: data processing, loading, training, and inference.
    • Proven expertise in optimizing and deploying inference workloads, with experience in profiling GPU/CPU code (e.g., Nvidia Nsight).
    • Deep understanding of distributed systems and frameworks, such as DDP, FSDP, and tensor parallelism.
    • Strong experience writing high-performance parallel C++ and custom PyTorch kernels, with knowledge of CUDA and Triton optimization techniques.
    • Bonus: Experience with generative models (e.g., Transformers, Diffusion Models, GANs) and prototype development (e.g., Gradio, Docker).
  • Technical Skills:
    • Proficiency in Python, with significant experience using PyTorch.
    • Advanced skills in CUDA/Triton programming, including custom kernel development and tensor core optimization.
    • Strong generalist software engineering skills and familiarity with distributed and parallel computing systems.

Note: This position is not intended for recent graduates.

Compensation

The salary range for this role in California is $175,000–$250,000 per year. Actual compensation will depend on job-related knowledge, skills, experience, and candidate location. We also offer competitive equity packages in the form of stock options and a comprehensive benefits plan.

Similar Jobs

Block Logo Block

Design Director

Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency
In-Office or Remote
8 Locations
12000 Employees
252K-377K Annually

NinjaOne Logo NinjaOne

Enterprise Account Executive

Information Technology • Productivity • Software • Infrastructure as a Service (IaaS)
Remote or Hybrid
California, USA
2000 Employees
150K-300K Annually

RapDev Logo RapDev

Senior Account Executive

Information Technology • Productivity • Professional Services • Software
Hybrid
California, USA
130 Employees
60K-150K Annually

Square Logo Square

Marketing Strategy Lead

eCommerce • Fintech • Hardware • Payments • Software • Financial Services
Remote or Hybrid
8 Locations
12000 Employees
136K-245K Annually
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
29 Employees
Year Founded: 2023

What We Do

An idea-to-video platform that brings your creativity to motion

Similar Companies Hiring

Axle Health Thumbnail
Logistics • Information Technology • Healthtech • Artificial Intelligence
Santa Monica, CA
17 Employees
Scrunch AI Thumbnail
Software • SEO • Marketing Tech • Information Technology • Artificial Intelligence
Salt Lake City, Utah
LayerOne Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
15 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account