Senior ML Infrastructure Engineer

Sorry, this job was removed at 12:23 a.m. (CST) on Thursday, Oct 30, 2025
Palo Alto, CA
In-Office
Artificial Intelligence • Healthtech
The Role
About Us

Hippocratic AI is developing the first safety-focused Large Language Model (LLM) for healthcare. Our mission is to dramatically improve healthcare accessibility and outcomes by bringing deep healthcare expertise to every person. No other technology has the potential for this level of global impact on health.

Why Join Our Team
  • Innovative mission: We are creating a safe, healthcare-focused LLM that can transform health outcomes on a global scale.

  • Visionary leadership: Hippocratic AI was co-founded by CEO Munjal Shah alongside physicians, hospital administrators, healthcare professionals, and AI researchers from top institutions including El Camino Health, Johns Hopkins, Washington University in St. Louis, Stanford, Google, Meta, Microsoft and NVIDIA.

  • Strategic investors: We have raised a total of $278 million in funding, backed by top investors such as Andreessen Horowitz, General Catalyst, Kleiner Perkins, NVIDIA’s NVentures, Premji Invest, SV Angel, and six health systems.

  • Team and expertise: We are working with top experts in healthcare and artificial intelligence to ensure the safety and efficacy of our technology.

For more information, visit www.HippocraticAI.com.

We value in-person teamwork and believe the best ideas happen together. Our team is expected to be in the office five days a week in Palo Alto, CA unless explicitly noted otherwise in the job description.

The Role

We are seeking a Machine Learning Infrastructure Engineer to design, build, and manage the next-generation training and inference platform for LLMs. You will be at the heart of building scalable, efficient infrastructure that supports our researchers and engineers in training, serving, and experimenting with large models at scale. Your work will directly impact our ability to innovate with new architectures and training techniques in production environments.

Key Responsibilities
  • LLM Training Infrastructure: Design and operate large-scale training clusters using Kubernetes and/or Slurm for LLM experimentation, fine-tuning, and RLHF workflows.

  • Cluster & GPU Management: Own scheduling, autoscaling, resource allocation, and monitoring across high-performance GPU clusters (NVIDIA, AMD).

  • Distributed Systems: Build and optimize distributed data pipelines using frameworks like Ray, enabling parallel training and inference jobs.

  • Inference Optimization: Benchmark and optimize model serving performance with technologies like vLLM, and support autoscaling of inference workloads in production environments.

  • Platform Reliability: Collaborate with infra and platform engineers to ensure system robustness, observability, and maintainability of ML workloads.

  • Research Enablement: Partner closely with ML researchers to enable rapid experimentation through flexible and efficient infrastructure tooling.

Preferred Qualifications
  • 5+ years of experience in infrastructure, MLOps, or systems engineering, ideally with time spent in architect or staff-level roles.

  • Proven experience managing large-scale Kubernetes or Slurm clusters for training or serving ML workloads.

  • Strong proficiency in Python; familiarity with Go or Rust is a plus.

  • Hands-on experience with Ray, vLLM, Hugging Face Transformers, and/or custom LLM training stacks.

  • Deep understanding of GPU scheduling, container orchestration, and workload optimization across heterogeneous hardware.

  • Experience with inference workloads, benchmarking, latency optimization, and cost-performance tradeoffs.

  • Familiarity with Reinforcement Learning, particularly RLHF frameworks, is a strong plus.

  • Contributions to internal platforms that enabled others to train or fine-tune LLMs efficiently.

Bonus Skills
  • Exposure to multiple hardware platforms (e.g., H100s, A100s, MI300X).

  • Experience with managing storage, IOPS performance, and object store integration for ML data.

  • Familiarity with building observability into ML pipelines (e.g., Prometheus, Grafana, Datadog).

  • Ability to present infra systems/platforms to technical stakeholders.

***Be aware of recruitment scams impersonating Hippocratic AI. All recruiting communication will come from @hippocraticai.com email addresses. We will never request payment or sensitive personal information during the hiring process. If anything appears suspicious, stop engaging immediately and report the incident.

Similar Jobs

General Motors Logo General Motors

Senior Software Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
Sunnyvale, CA, USA
165000 Employees
153K-234K Annually

General Motors Logo General Motors

Infrastructure Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
2 Locations
165000 Employees

Match Group Logo Match Group

Senior Software Engineer

Mobile • Social Media
Hybrid
Palo Alto, CA, USA
1400 Employees
220K-250K Annually
Easy Apply
In-Office or Remote
Los Angeles, CA, USA
56 Employees
150K-240K Annually
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Palo Alto, California
97 Employees
Year Founded: 2023

What We Do

Hippocratic AI’s mission is to develop the first safety-focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically improve healthcare accessibility and health outcomes in the world by bringing deep healthcare expertise to every human. No other technology has the potential to have this level of global impact on health.
The company was co-founded by CEO Munjal Shah, alongside a group of physicians, hospital administrators, healthcare professionals, and artificial intelligence researchers from El Camino Health, Johns Hopkins, Washington University in St. Louis, Stanford, Google, Microsoft, Meta and NVIDIA. Hippocratic AI has received a total of $137 million in funding and is backed by leading investors, including General Catalyst, Andreessen Horowitz, Premji Invest, SV Angel, NVentures (Nvidia Venture Capital), and Greycroft. For more information on Hippocratic AI: www.HippocraticAI.com.

Similar Companies Hiring

Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees
Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account