Machine Learning Engineer - ML Training Platform

Posted 3 Days Ago
Be an Early Applicant
2 Locations
In-Office
Senior level
Artificial Intelligence • Information Technology • Software
The Role
Design and implement robust, large-scale distributed ML training systems optimized for low-bandwidth, high-latency environments. Build model-parallel training strategies, checkpointing and recovery, GPU and memory optimizations, P2P networking, NAT traversal, and monitoring to ensure resilient, efficient multi-participant training.
Summary Generated by Built In
Overview

Pluralis Research carries out foundational research on Protocol Learning: multi-participant training of foundation models where no single participant has, or can ever obtain, a full copy of the model. The purpose of Protocol Learning is to facilitate the creation of community-trained and community-owned frontier models with self-sustaining economics.

We're looking for Senior/Staff engineers with 5+ years of experience in distributed systems and ML large-scale training. You'll be implementing a novel substrate for training distributed ML models that work under consumer grade internet connection.

Responsibilities

Distributed Training Architecture & Optimization
  • Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions.

  • Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead.

  • Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.

  • Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs.

  • Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks.

Decentralized Networking & Resilience
  • Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave.

  • Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes.

  • Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management.

  • Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.

What You’ll Bring
  • Strong experience building and operating distributed systems in production.

  • Hands-on expertise with distributed training frameworks (FSDP, DeepSpeed, Megatron, or similar).

  • Deep understanding of model parallelism (data, tensor, pipeline parallelism).

  • Expert-level Python with production experience (concurrency, error handling, retry logic, clean architecture).

  • Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, distributed coordination.

  • Experience optimizing GPU workloads, memory management, and large-scale compute efficiency.

What We Offer
  • Equity-heavy compensation with meaningful ownership in a mission-driven company

  • Competitive base salary for senior engineering roles in Australia

  • Visa sponsorship available for exceptional candidates

  • Remote-first with optional access to our Melbourne hub

  • World-class team — team mates were previously at at Google, Amazon, Microsoft, and leading startups

Backed by Union Square Ventures and other tier-1 investors, we're a world-class, deeply technical team of ML researchers and engineers. Pluralis is unapologetically ideological. We view the world as a better place if we are able to implement what we are attempting, and Protocol Learning as the only plausible approach to preventing a handful of massive corporations monopolising model development, access and release, and achieving massive economic capture. If this resonates, please apply.

Top Skills

Python,Fsdp,Deepspeed,Megatron,Grpc,P2P,Nat Traversal
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
15 Employees

What We Do

Pluralis is developing a protocol that facilitates collaborative training and ownership of foundation models.

Similar Jobs

Hybrid
2 Locations
750 Employees
115K-130K Annually
Remote or Hybrid
Australia
850 Employees

ServiceNow Logo ServiceNow

Sales Executive

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Melbourne, Victoria, AUS
28000 Employees
300K-500K Annually

IMC Trading Logo IMC Trading

Compliance Officer

Fintech • Machine Learning • Software • Financial Services
Hybrid
4 Locations
1954 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account