MLOps Engineer

Reposted 3 Days Ago
Be an Early Applicant
Ho Chi Minh City, Ho Chi Minh, VNM
Hybrid
Expert/Leader
Artificial Intelligence • Robotics
The Role
The MLOps Engineer will manage PyTorch training/inference pipelines, optimize workloads, troubleshoot issues, and integrate monitoring in AI services.
Summary Generated by Built In

Job Title:
MLOps Engineer (PyTorch, Systems & Training Pipeline)
About the Role
As an MLOps Engineer, you will own and evolve the infrastructure behind our PyTorch-based training and inference workloads. You will work at the intersection of deep learning, systems programming, and infrastructure engineering, building pipelines that are robust, reproducible, and built to last. This role spans training infrastructure, inference serving, and platform reliability, and is ideal for someone who cares not just about getting models trained, but doing it right.
Key Responsibilities

  • Build and maintain training and inference pipelines using PyTorch, including support for DDP, mixed precision, checkpointing, experiment versioning, and reproducible evaluation workflows.
  • Own and evolve inference serving infrastructure using vLLM and SGLang, including debugging issues in inference stacks such as tool call parsers and reasoning parsers, and optimizing for throughput and latency.
  • Write and maintain robust tooling in Python and C++ to support the full training lifecycle, from data ingestion to model release.
  • Optimize compute workloads for bare-metal environments, covering CPU/GPU utilization, memory bandwidth, and I/O throughput.
  • Troubleshoot low-level networking issues, distributed training errors, and hardware bottlenecks across NCCL, MPI, and high-speed interconnects such as InfiniBand and RoCE.
  • Set up and manage ML environments including containers, package management, GPU drivers, and runtime configurations.
  • Establish CI/CD patterns for AI workloads covering training, evaluation, quantization, and model release workflows.
  • Integrate monitoring, alerting, anomaly detection, and incident response for both training jobs and inference services.
  • Contribute to shared platform capabilities across reliability, observability, and cost management.
  • Build and maintain scalable runtime infrastructure for model-backed services and APIs, including support for LLM-backed APIs, MCP (Model Context Protocol) servers, and agentic systems.

You Should Have

  • Deep expertise in PyTorch internals, including DDP, FSDP, mixed precision training, TorchScript, and torch.compile.
  • Strong programming skills in Python and C++, with the ability to read and safely modify unfamiliar codebases.
  • Solid computer science fundamentals covering data structures, concurrency, operating systems, and memory management.
  • Hands-on experience with vLLM and SGLang for production inference serving, including serving quantized models such as FP8, INT8, and NVFP4.
  • Experience with RLHF and PPO training pipelines, including frameworks such as veRL and TRL, and reward model integration.
  • Strong understanding of distributed training setups, networking, and interconnects including NCCL, MPI, InfiniBand, and RoCE.
  • Experience debugging and tuning bare-metal Linux servers, including kernel parameters, NUMA topology, and GPU driver configuration.
  • Familiarity with job schedulers such as Airflow and experience operating production-grade distributed infrastructure.
  • Strong grasp of containerized and cloud-native environments including Docker and Kubernetes.

Nice to Have

  • Experience with ML compiler stacks such as LLVM, MLIR, TensorRT, or XLA.
  • Familiarity with model quantization techniques and deployment optimization, including GPTQ, AWQ, and bitsandbytes.
  • Contributions to open source ML projects, including PyTorch, vLLM, SGLang, or related inference and training tooling.
  • Experience with infrastructure-as-code tools such as Ansible, Terraform, or Nix for reproducible cluster setup.
  • Experience with custom or on-premise deployments, local clusters, or edge inference.
  • Familiarity with observability stacks such as Prometheus, Grafana, or OpenTelemetry applied to training and inference workloads.
  • Experience building infrastructure for agentic systems including secure tool access, orchestration, and isolation boundaries.
  • Passion for clean, well-documented code and detail-oriented engineering.

Top Skills

Airflow
C++
Ddp
Docker
Fsdp
Kubernetes
Python
PyTorch
Sglang
Torchscript
Vllm
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
64 Employees
Year Founded: 2023

What We Do

Menlo Research is an open AI & Robots lab. We build the brains for robots. It’s time to tell robots what to do!

Similar Jobs

Mastercard Logo Mastercard

Consultant

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Remote or Hybrid
Quận 1, Ho Chi Minh, VNM
38800 Employees

Motorola Solutions Logo Motorola Solutions

Strategic Territory Director Vietnam / Thailand

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Remote or Hybrid
Vietnam
23000 Employees

Mastercard Logo Mastercard

Senior Specialist, Customer Success - Security Solutions

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Remote or Hybrid
Quận 1, Ho Chi Minh, VNM
38800 Employees

Mastercard Logo Mastercard

Manager, Customer Success

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Remote or Hybrid
Quận 1, Ho Chi Minh, VNM
38800 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account