Software Engineer - Cloud Infrastructure

Reposted 6 Days Ago
Mountain View, CA
Hybrid
145K-215K Annually
Mid level
Artificial Intelligence • Cloud • Machine Learning • Software • Database
The Role
As a Cloud Infrastructure Engineer, you will manage and optimize Kubernetes clusters across multiple cloud platforms while enhancing automation and reliability for AI applications.
Summary Generated by Built In
About Kumo.ai

Kumo is building the infrastructure layer for the next generation of enterprise AI — a platform that lets organizations turn their data into predictive intelligence instantly, without the heavy lifting of traditional ML pipelines. We have also built our own Relational Foundation Model that can provide predictions in seconds – no training, straight to business value!

Join a dynamic, rapidly expanding team of innovators from top-tier companies like Airbnb, LinkedIn, Pinterest, and Stanford, supported by the renowned Sequoia Capital. We're on the front lines of AI, solving some of its most challenging and impactful problems, and we've already delivered over $500M+ in tangible value to industry giants like Reddit, DoorDash, and Databricks. If you thrive in a fast-paced environment, are driven by ambitious goals, and crave an opportunity for massive impact, this is your chance to shape the future of AI.

The Opportunity

Kumo’s platform runs thousands of predictive workloads across multi-tenant Kubernetes clusters that form the backbone of our AI stack. As an Cloud Infrastructure Engineer you’ll own, scale, and optimize that platform — from real-time inference to large-scale training — with real production impact. You’ll make high-leverage architectural decisions, ship quickly, and collaborate across ML, product, and engineering teams to expand our multi-cloud capabilities. Expect to move fast, iterate often, and see your changes land in production within days — not quarters.

What You’ll Do

  • Design, build, and evolve Kumo’s multi-tenant infrastructure to support massive AI and data workloads across AWS, Azure, and GCP.
  • Implement and maintain infrastructure-as-code to automate training and deployment pipelines across many environments.
  • Operate and scale Kubernetes clusters with a focus on reliability, performance, availability, tenant isolation, and cost efficiency.
  • Build observability and alerting into distributed systems using Prometheus, Grafana, OpenTelemetry, and related tooling.
  • Partner closely with ML researchers and product teams to deliver production-grade infrastructure for advanced AI workloads.
  • Drive security and operational best-practices (RBAC, tenant isolation, cloud identity, etc.) across our platform.

What You Bring

  • 3–5 years building or operating cloud-native infrastructure in production.
  • Hands-on experience with at least one major cloud (AWS / Azure / GCP); multi-cloud exposure is a plus.
  • Operational experience with Kubernetes and production-grade clusters.
  • Proficiency with Infrastructure-as-Code (Terraform, Pulumi, etc.) and familiarity with GitOps tooling (ArgoCD, Flux, Argo Workflows).
  • Strong debugging, systems-thinking, and communication skills — you can drive technical decisions and explain trade-offs to multiple stakeholders.

Nice to Have

  • Experience operating multi-tenant Kubernetes for data / AI workloads.
  • Experience with (managed) Spark or large-scale data processing systems.
  • Familiarity with Kubernetes operators, controllers, and custom resources.
  • Deep experience with monitoring/tracing/logging stacks (Prometheus, OpenTelemetry, etc.)

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Top Skills

Argo Cd
AWS
Azure
Bash
Crossplane
Flux
GCP
Go
Grafana
Kubernetes
Opentelemetry
Prometheus
Pulumi
Python
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Mountain View, CA
38 Employees
Year Founded: 2021

What We Do

Democratizing AI on the Modern Data Stack!

The team behind PyG (PyG.org) is working on a turn-key solution for AI over large scale data warehouses. We believe the future of ML is a seamless integration between modern cloud data warehouses and AI algorithms. Our ML infrastructure massively simplifies the training and deployment of ML models on complex data.

With over 40,000 monthly downloads and nearly 13,000 Github stars, PyG is the ultimate platform for training and development of Graph Neural Network (GNN) architectures. GNNs -- one of the hottest areas of machine learning now -- are a class of deep learning models that generalize Transformer and CNN architectures and enable us to apply the power of deep learning to complex data. GNNs are unique in a sense that they can be applied to data of different shapes and modalities.

Similar Jobs

PayPal Logo PayPal

Staff Software Engineer

Fintech • Payments
In-Office
3 Locations
34450 Employees
170K-292K Annually

Taara Logo Taara

Software Engineer

Information Technology • Software
In-Office
Sunnyvale, CA, USA
63 Employees
160K-210K Annually

NVIDIA Logo NVIDIA

Software Engineer

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
In-Office or Remote
3 Locations
21960 Employees
184K-357K Annually

Oscar Health Logo Oscar Health

Senior Software Engineer

Healthtech • Insurance
In-Office
Los Angeles, CA, USA
2200 Employees
181K-237K Annually

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account