GPU Cloud Platform Engineer

Reposted 7 Days Ago
5 Locations
Remote
Senior level
Artificial Intelligence • Information Technology • Software
The Role
The GPU Cloud Platform Engineer designs and operates multi-cluster GPU infrastructures for AI workloads, ensuring performance and efficiency across cloud environments.
Summary Generated by Built In

Location: Remote (Global)

Type: Full-time

Company: Yotta Labs

Apply: [email protected]

🧠 About Yotta Labs

Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development.

🛠️ Role Overview

We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform.

🎯 Responsibilities

  • Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues.

  • Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks.

  • Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users.

  • Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization.

  • Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms.

  • Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs.

Qualifications

  • Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps.

  • 5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands-on experience in Kubernetes multi-cluster management and orchestration.

  • Familiarity with the Kubernetes ecosystem; hands-on experience with tools such as kubectl, Helm, and expertise in multi-cluster deployment, upgrade, scaling, and disaster recovery.

  • Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution.

  • Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting.

  • Hands-on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture.

  • Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus.

  • Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks.

  • Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe.

  • Strong communication skills, self-motivation, and team collaboration

🌟 Preferred Experience

  • Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects.

  • Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100.

  • Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python.

  • Hands-on experience with optimization techniques such as model quantization, static compilation, and multi-GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency.

  • Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions.

🌐 Why Join Yotta Labs?

  • Be part of a visionary team aiming to redefine AI infrastructure.

  • Work on cutting-edge technologies that bridge AI and decentralized computing.

  • Collaborate with experts from leading institutions and tech companies.

  • Enjoy a flexible, remote work environment that values innovation and autonomy.

📩 How to Apply

Interested candidates should apply directly or send their resume and a brief cover letter to [email protected]. Please include links to any relevant projects or contributions.

Top Skills

AWS
Azure
Cuda
Docker
GCP
Go
Grafana
Kubernetes
Prometheus
Python
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Seattle, WA
16 Employees
Year Founded: 2024

What We Do

Yotta Labs is at the forefront of building a cutting-edge protocol that serves as the Decentralized OS for AI workload orchestration at Planet Scale. The Decentralized Operating System (DeOS) from Yotta is designed to maximize the utilization of available resources by optimizing LLM training/inference flows and efficiently scheduling AI workloads across decentralized networks running geo-distributed GPUs worldwide, pushing the aggregated processing limit to an unprecedented Yottascale. (Yottascale is 1,000,000 of exascale, which is current limit of the fastest supercomputer in the world)

Founded by a team of industry and academia experts in AI and HPC (High-performance Computing), Yotta Labs team has a proven track record of delivering exceptional work. Through cutting-edge approaches invented by the team to optimize resource orchestration and intra-/inter-node communication, we strive to unlock the maximum potential of decentralized AI.

For more information about aelf, please refer to our Whitepaper: https://yottalabs.ai/whitepaper

Similar Jobs

CrowdStrike Logo CrowdStrike

Engineer II - Front End (Remote, CAN)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
5 Locations
10000 Employees
100K-135K Annually

CrowdStrike Logo CrowdStrike

Technical Account Manager

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
QC, CAN
10000 Employees
115K-160K Annually

Applied Systems Logo Applied Systems

Enterprise Account Executive

Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
Remote or Hybrid
Canada
3000 Employees
200K-200K Annually

Block Logo Block

Account Executive

Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency
In-Office or Remote
8 Locations
12000 Employees
123K-223K Annually

Similar Companies Hiring

Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account