Senior Software Engineer, AI Training & Infrastructure

Reposted 10 Days Ago
Be an Early Applicant
San Mateo, Rizal, Calabarzon, PHL
In-Office
200K-300K Annually
Senior level
Artificial Intelligence • Robotics • Business Intelligence
The Role
The Senior Software Engineer will build and scale training infrastructure, optimize ML performance, and develop tools for robotics applications.
Summary Generated by Built In
Company Overview

At Skild AI, we are building the world's first general purpose robotic intelligence that is robust and adapts to unseen scenarios without failing. We believe massive scale through data-driven machine learning is the key to unlocking these capabilities for the widespread deployment of robots within society. Our team consists of individuals with varying levels of experience and backgrounds, from new graduates to domain experts. Relevant industry experience is important, but ultimately less so than your demonstrated abilities and attitude. We are looking for passionate individuals who are eager to explore uncharted waters and contribute to our innovative projects.

Position Overview

Skild AI, Inc. seeks a Senior Software Engineer, AI Training & Infrastructure in San Mateo, CA responsible for building and scaling training infrastructure and tools that support the full ML lifecycle—data preparation, training orchestration, evaluation, and deployment—for real-world robotics applications. This includes performance, reliability, observability, and developer productivity across distributed training systems, as well as data processing for multimodal datasets, performance tuning of training jobs, and media processing/compression (e.g., ffmpeg). Specific duties include: (i) architecting, building, and maintaining distributed training pipelines and frameworks spanning data ingest/preprocessing, large-scale training, and evaluation; (ii) optimizing training performance and resource utilization by identifying bottlenecks and implementing improvements in data loading, I/O, caching, sharding, and prefetching; (iii) integrating state-of-the-art ML techniques into production training systems in collaboration with research/ML teams; (iv) implementing monitoring, logging, alerting, automated testing, and CI/CD for reliable training operations; and (v) developing developer tooling and documentation, including dashboards and utilities, to streamline experimentation at scale and improve engineer productivity.

Responsibilities
  • Architecting, building, and maintaining distributed training pipelines and frameworks spanning data ingest/preprocessing, large-scale training, and evaluation.
  • Optimizing training performance and resource utilization by identifying bottlenecks and implementing improvements in data loading, I/O, caching, sharding, and prefetching.
  • Integrating state-of-the-art ML techniques into production training systems in collaboration with research/ML teams.
  • Implementing monitoring, logging, alerting, automated testing, and CI/CD for reliable training operations.
  • Developing developer tooling and documentation, including dashboards and utilities, to streamline experimentation at scale and improve engineer productivity.
Minimum Requirements
  • Must have a master’s degree (or foreign equivalent) in Computer Science, Robotics, Engineering, or a related field and two (2) years of experience in machine learning infrastructure. Experience can be concurrent.
  • Must also have two (2) years of experience designing and operating distributed training pipelines at scale, including data preprocessing, orchestration, and evaluation. Experience can be concurrent.
  • Must have any experience with each of the following: (i) Python or C++ and at least one deep learning library (e.g., PyTorch, TensorFlow, or JAX); and (ii) CI/CD and automated testing for ML/infra services. Experience can be concurrent.
  • Must have knowledge of: (i) optimizing data loading and I/O for deep learning workloads (e.g., PyTorch DataLoader, sharding, prefetching, or caching); (ii) processing multimodal datasets and formats (e.g., HDF5, TFRecord, Parquet, or equivalent) and image processing/compression (e.g., OpenCV or ffmpeg); (iii) cloud-based training in AWS, Google Cloud, or Azure; (iv) implementing monitoring, logging, and alerting for training systems; (v) Linux OS fundamentals and operation at large scale; (vi) distributed systems and ML training techniques/models; and (vii) core software engineering principles, including algorithms, data structures, and system design. Experience can be concurrent.

Apply online at skild.ai/career.


Base Salary Range
$200,000$300,000 USD

Skills Required

  • Master's degree in Computer Science, Robotics, Engineering, or a related field
  • Two years of experience in machine learning infrastructure
  • Two years of experience designing and operating distributed training pipelines
  • Experience with Python or C++ and at least one deep learning library
  • Knowledge of data loading optimization and I/O for deep learning
  • Knowledge of processing multimodal datasets and formats
  • Cloud-based training knowledge (AWS, Google Cloud, or Azure)
  • Basic Linux OS fundamentals and operation at large scale
  • Knowledge of distributed systems and ML training techniques/models
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Pittsburgh, , Pennsylvania
24 Employees
Year Founded: 2023

What We Do

Building general purpose robotic intelligence

Similar Jobs

Capital One Logo Capital One

HR Help Center Coordinator

Fintech • Machine Learning • Payments • Software • Financial Services
Remote or Hybrid
City of Muntinlupa, Rizal, Calabarzon, PHL
55000 Employees

Digible Logo Digible

Account Executive

AdTech • Agency • Artificial Intelligence • Digital Media • Marketing Tech • Social Media • PropTech
Easy Apply
Remote or Hybrid
PHL
145 Employees
100K-120K Annually

Digible Logo Digible

Principal, Strategic Operations

AdTech • Agency • Artificial Intelligence • Digital Media • Marketing Tech • Social Media • PropTech
Easy Apply
Remote or Hybrid
PHL
145 Employees
120K-165K Annually

Digible Logo Digible

Engineering Manager

AdTech • Agency • Artificial Intelligence • Digital Media • Marketing Tech • Social Media • PropTech
Easy Apply
Remote or Hybrid
PHL
145 Employees
170K-240K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Hardware • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
31 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account