At Mind Robotics, we’re building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models depends on world-class ML infrastructure.
We’re looking for a Machine Learning Infrastructure Engineer to build the core systems that enable fast, reliable, and scalable model training—powering everything from experimentation to production deployment.
ResponsibilitiesDesign and implement scalable systems for training large ML models
Enable efficient workflows for data ingestion, training, and iteration
Develop and optimize distributed training systems across hundreds of GPUs
Implement strategies for parallelization, sharding, and efficient compute utilization
Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management
Partner closely with modeling teams to accelerate iteration speed and reduce training costs
Build internal tools for experiment tracking, monitoring, and debugging
Implement systems for tracking training performance, failures, and resource utilization
Debug and resolve bottlenecks across the training stack
Provide lightweight infrastructure support for deploying and running models in production environments
Optimize inference performance and reliability where needed
Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead)
Manage compute resources efficiently across training jobs
Strong experience building infrastructure for large-scale ML training
Deep understanding of how modern LLM/VLM systems are trained and scaled
Proven experience setting up and scaling distributed training across hundreds of GPUs
Strong understanding of parallelization strategies (data, model, pipeline parallelism)
Strong proficiency in Python programming
Expert-level proficiency in PyTorch and/or JAX
Strong understanding of techniques like attention optimization, kernel fusion, and efficient memory usage
Experience supporting inference systems in production
Familiarity with robotics or embodied AI workloads
Experience building tools for experiment management and researcher productivity
Skills Required
- Experience with PyTorch or JAX
- Knowledge of distributed training and core ML infrastructure
- Ability to work with hundreds of GPUs
What We Do
Mind Robotics builds intelligent, AI-driven robotic systems for industrial deployment, focusing on creating collaborative platforms for manufacturing environments.
.png)








