Principal Engineer, System Software Platform Engineering

Reposted 4 Days Ago
Be an Early Applicant
2 Locations
In-Office
Expert/Leader
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Role
The role involves leading the architecture and operation of AI inference platforms, focusing on GPU operations, security, and reliability in cloud environments. Responsibilities include managing large-scale distributed systems, improving observability, and mentoring engineers.
Summary Generated by Built In

NVIDIA Vietnam R&D Center is an integral part of NVIDIA global network of world class Engineers and Researchers. To help push the boundary of Accelerated Computing, we’re seeking a hands-on technical leader to architect, build, and operate a platform for AI inference and agentic applications. You’ll focus on heterogeneous compute (with a strong GPU emphasis), reliability, security, and developer experience across cloud and hybrid environments.

What you will do:

  • Build and operate the platform for AI: multi-tenant services, identity/policy, configuration, quotas, cost controls, and paved paths for teams.

  • Lead inference platforms at scale, including model-serving routing, autoscaling, rollout safety (canary/A-B), ensuring reliability, and maintaining end-to-end observability.

  • Operate GPUs in Kubernetes: lead NVIDIA device plugins, GPU Feature Discovery, time-slicing, MPS, and MIG partitioning; implement topology-aware scheduling and bin-packing.

  • Lead GPU lifecycle: driver/firmware/Runtime (CUDA, cuDNN, NCCL) updates via NVIDIA GPU Operator; ensure kernel/RHEL/Ubuntu compatibility and safe rollouts.

  • Enable virtualization strategies: vGPU (e.g., on vSphere/KVM), PCIe passthrough, mediated devices, and pool-based GPU sharing; define placement, isolation, and preemption policies.

  • Build secure traffic and networking: API gateways, service mesh, rate limiting, authN/authZ, multi-region routing, and DR/failover.

  • Improve observability and operations through metrics, tracing, and logging for DCGM/GPUs, runbooks, incident response, performance, and cost optimization.

  • Establish platform blueprints: reusable templates, SDKs/CLIs, golden CI/CD pipelines, and infrastructure-as-code standards.

  • Lead through influence: write design docs, conduct reviews, mentor engineers, and shape platform roadmaps aligned to AI product needs.

What we need to see:

  • 15+ years building/operating large-scale distributed systems or platform infrastructure; strong record of shipping production services.

  • Proficiency in one or more of Python/Go/Java/C++; deep understanding of concurrency, networking, and systems design.

  • Containers/orchestration/Kubernetes expertise, cloud networking/storage/IAM, and infrastructure-as-code.

  • Practical GPU platform experience: Kubernetes GPU operations (device plugin, GPU Operator, feature discovery), scheduling/bin-packing, isolation, preemption, utilization tuning.

  • Virtualization background: deploying and operating vGPU, PCIe pass-through, and/or mediated devices in production.

  • SRE or equivalent experience: SLOs/error budgets, incident management, performance tuning, resource management, and financial oversight.

  • Security-first mentality: TLS/mTLS, RBAC, secrets, policy-as-code, and secure multi-tenant architectures.

Ways to stand out from a crowd:

  • Deep GPU ops: MIG partitioning, MPS sharing, NUMA/topology awareness, DCGM telemetry, GPUDirect RDMA/Storage.

  • Inference platform exposure: serving runtimes, caching/batching, autoscaling patterns, continuous delivery (agnostic to specific stacks).

  • Agentic platform exposure: workflow engines, tool orchestration, policy/guardrails for tool access and data boundaries.

  • Traffic/data plane: gRPC/HTTP/Protobuf performance, service mesh, API gateways, CDN/caching, global traffic management.

  • Tooling: Terraform/Helm/GitOps, Prometheus/Grafana/OpenTelemetry, policy engines; bare-metal provisioning experience is a plus.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Top Skills

C++
Cuda
Cudnn
Docker
Go
Grafana
Helm
Java
Kubernetes
Nccl
Opentelemetry
Pcie
Prometheus
Python
Terraform
Vgpu
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Santa Clara, CA
21,960 Employees
Year Founded: 1993

What We Do

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”

Similar Jobs

Motorola Solutions Logo Motorola Solutions

Sales Engineer

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Hybrid
Hanoi, VNM

Pfizer Logo Pfizer

Implementation Manager

Artificial Intelligence • Healthtech • Machine Learning • Natural Language Processing • Biotech • Pharmaceutical
Hybrid
Hanoi, VNM
6-8

CodeLink Logo CodeLink

Full-stack Engineer

Artificial Intelligence • Information Technology • Machine Learning • Software
In-Office
3 Locations

AvePoint Logo AvePoint

Staff Accountant

Cloud • Information Technology • Software
In-Office
Hanoi, VNM

Similar Companies Hiring

Scrunch AI Thumbnail
Software • SEO • Marketing Tech • Information Technology • Artificial Intelligence
Salt Lake City, Utah
Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account