Machine Learning - Infrastructure

Reposted 3 Days Ago
Be an Early Applicant
San Francisco, CA, USA
In-Office
Mid level
Artificial Intelligence • Machine Learning • Software • Analytics
The Role
Design, deploy, and maintain distributed ML training clusters. Develop scalable pipelines for managing datasets and training, and optimize GPU performance.
Summary Generated by Built In

Our mission is general causal intelligence, AI that is capable of (1) predicting the future and (2) identifying the optimal actions to change that future.

To achieve this breakthrough, we are building a Large Physics foundation Model (LPM) because domains governed by physics have inherent cause and effect relationships, unlike visual or textual data.

Weather is the ideal training ground for an LPM. It is the most well-observed physical system, offering rapid, objective ground truth feedback from sensory observations and data at a scale that dwarfs what is used to train today’s LLMs.

Causal Labs is a team of researchers and engineers from self-driving, drug discovery, and robotics - including Google DeepMind, Cruise, Waymo, Insitro, and Nabla Bio - who believe general causal intelligence will be the most important technical breakthrough for civilization.

We look for infrastructure engineers who are excited to tackle unsolved problems.

Our training and inference challenges demand deep expertise in setting up distributed training clusters and optimizing performance for large models. If you have experience building large-scale ML infrastructure in related fields such as language and vision models, robotics, biology -- join us on this mission.

Responsibilities

  • Design, deploy, and maintain large distributed ML training and inference clusters

  • Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle

  • Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales

  • Analyze, profile and debug low-level GPU operations to optimize performance

  • Stay up-to-date on research to bring new ideas to work

What we’re looking for

We value a relentless approach to problem-solving, rapid execution, and the ability to quickly learn in unfamiliar domains.

  • Strong grasp of state-of-the-art techniques for optimizing training and inference workloads

  • Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models

  • Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings

  • Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)

  • Background working on distributed task management systems and scalable model serving & deployment architectures

  • Understanding of monitoring, logging, observability, and version control best practices for ML systems

You don’t have to meet every single requirement above.

Skills Required

  • Proficiency with distributed training frameworks
  • Knowledge of cloud platforms and their ML/AI services
  • Familiarity with containerization and orchestration frameworks
  • Experience in distributed task management systems
  • Understanding of monitoring and logging best practices
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
0 Employees
Year Founded: 2024

What We Do

Causal Labs is building a Large Physics foundation Model (LPM) to achieve general causal intelligence, enabling AI to predict the future and identify optimal actions by learning causality through physics and weather.

Similar Jobs

Snap Inc. Logo Snap Inc.

Staff Software Engineer

Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
Hybrid
3 Locations
5000 Employees
195K-343K Annually

General Motors Logo General Motors

Machine Learning Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Remote or Hybrid
3 Locations
165000 Employees
185K-335K Annually

Snap Inc. Logo Snap Inc.

Software Engineer

Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
Hybrid
3 Locations
5000 Employees
133K-235K Annually
In-Office or Remote
San Francisco, CA, USA
177K-365K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account