Kumo

Software Engineer Lead - Cloud Engineering

Reposted 3 Days Ago

Be an Early Applicant

Mountain View, CA

Hybrid

175K-250K Annually

Expert/Leader

Artificial Intelligence • Cloud • Machine Learning • Software • Database

The Role

Architect and operate scalable Kubernetes infrastructure for AI workloads, manage multi-cloud deployments, automate processes, and enhance system reliability.

Summary Generated by Built In

About Kumo.ai

Kumo is building a next-generation AI platform that empowers organizations to make predictive decisions faster—without the overhead of traditional ML pipelines. Backed by Sequoia and led by ex-Airbnb, Pinterest, and LinkedIn leaders, we’re scaling rapidly and looking for a multi-cloud infrastructure leader to architect and run the backbone of our AI platform.

This is one of our most critical hires — your work will directly power the models and applications our customers rely on every day. If you’re passionate about multi-cloud infrastructure, Kubernetes at scale, and building the infrastructure that powers the next generation of AI applications — we’d love to talk.

Why Kumo.ai?

Work alongside world-class engineers & scientists (ex-Airbnb, Pinterest, LinkedIn, Stanford).
Be a foundational voice in designing a platform powering enterprise-scale AI.
Competitive Series B compensation package (salary + meaningful equity).

The Opportunity - The Cloud Infrastructure team is responsible for managing and scaling our Kubernetes-based, multi-cloud AI platform across AWS, Azure, and GCP.

You will own the architecture, scalability, security, and operational excellence of this platform, building the foundation that supports massive multi-tenant clusters running Big Data and AI/ML workloads.
Lead our multi-cloud expansion beyond AWS into Azure and GCP.
Drive the design and implementation of Kubernetes controllers, operators, and automation for scaling and reliability.
Implement Infrastructure as Code (Terraform, Pulumi, Crossplane) and GitOps practices to deliver commit-to-production automation at scale.
Partner closely with ML scientists, product engineers, and leadership to deliver self-service tooling and optimize infrastructure for machine learning workloads.
You will be joining early enough to shape the architecture, culture, and processes that define our platform reliability and engineering velocity.

What You’ll Do

Architect & operate multi-cloud infrastructure (AWS, Azure, GCP) to support large-scale AI workloads.
Design, build and scale Kubernetes clusters (EKS, AKS, GKE, Open Source) for high availability, performance, and cost efficiency.
Build and maintain Kubernetes controllers, operators, and automation for cluster lifecycle management, scaling, and workload scheduling.
Implement observability at scale — metrics, logging, tracing — using tools like Prometheus, Grafana, and OpenTelemetry.
Lead IaC and GitOps automation, ensuring consistent, repeatable provisioning and deployment workflows.
Drive security and compliance policies (RBAC, tenant isolation, SOC2/GDPR readiness) into platform design.
Partner with internal teams to enable self-service cloud resources and smooth commit-to-production pipelines.

What You Bring

8+ years building and operating cloud-native infrastructure in production.
Proven multi-cloud experience — designing and running workloads across AWS, Azure, and GCP.
Kubernetes expertise — 5+ years managing production clusters, with strong understanding of internals (schedulers, controllers, operators, CNI networking, security).
Infrastructure-as-Code mastery — Terraform, Pulumi, Crossplane, or similar.
GitOps and workflow automation experience (ArgoCD, Flux, Argo Workflows, or similar).
Strong skills in monitoring and performance tuning for distributed systems.
Proficiency in Go, Python, or Rust for automation tooling.

Nice to Have

Experience in optimizing, scaling, and maintaining multi-tenanted AI/ML clusters across multiple cloud environments, ensuring high availability and performance.
Familiarity with compliance standards (SOC2, ISO27001, GDPR).
Contributions to open-source cloud-native projects.
Experience building customer-facing APIs or developer tooling.

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Top Skills

Ansible

Argo

AWS

Azure

Bash

Calico

CloudFormation

Docker

Envoy

Flux

GCP

Grafana

Istio

Jenkins

Kubernetes

Make

Prometheus

Python

Rust

Terraform

Tigera

Traefik

View all jobs at Kumo

View Kumo Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Mountain View, CA

38 Employees

Year Founded: 2021

What We Do

Democratizing AI on the Modern Data Stack!

The team behind PyG (PyG.org) is working on a turn-key solution for AI over large scale data warehouses. We believe the future of ML is a seamless integration between modern cloud data warehouses and AI algorithms. Our ML infrastructure massively simplifies the training and deployment of ML models on complex data.

With over 40,000 monthly downloads and nearly 13,000 Github stars, PyG is the ultimate platform for training and development of Graph Neural Network (GNN) architectures. GNNs -- one of the hottest areas of machine learning now -- are a class of deep learning models that generalize Transformer and CNN architectures and enable us to apply the power of deep learning to complex data. GNNs are unique in a sense that they can be applied to data of different shapes and modalities.