Software Engineer - Cloud Infrastructure

Reposted 24 Days Ago
Mountain View, CA
Hybrid
145K-215K Annually
Mid level
Artificial Intelligence • Cloud • Machine Learning • Software • Database
The Role
As a Cloud Infrastructure Engineer, you will manage and optimize Kubernetes clusters across multiple cloud platforms while enhancing automation and reliability for AI applications.
Summary Generated by Built In
About Kumo.ai

Kumo is building the infrastructure layer for the next generation of enterprise AI — a platform that lets organizations turn their data into predictive intelligence instantly, without the heavy lifting of traditional ML pipelines. We have also built our own Relational Foundation Model that can provide predictions in seconds – no training, straight to business value!

Join a dynamic, rapidly expanding team of innovators from top-tier companies like Airbnb, LinkedIn, Pinterest, and Stanford, supported by the renowned Sequoia Capital. We're on the front lines of AI, solving some of its most challenging and impactful problems, and we've already delivered over $500M+ in tangible value to industry giants like Reddit, DoorDash, and Databricks. If you thrive in a fast-paced environment, are driven by ambitious goals, and crave an opportunity for massive impact, this is your chance to shape the future of AI.

The Opportunity

Kumo’s platform runs thousands of predictive workloads across multi-tenant Kubernetes clusters that form the backbone of our AI stack. As an Cloud Infrastructure Engineer you’ll own, scale, and optimize that platform — from real-time inference to large-scale training — with real production impact. You’ll make high-leverage architectural decisions, ship quickly, and collaborate across ML, product, and engineering teams to expand our multi-cloud capabilities. Expect to move fast, iterate often, and see your changes land in production within days — not quarters.

What You’ll Do

  • Design, build, and evolve Kumo’s multi-tenant infrastructure to support massive AI and data workloads across AWS, Azure, and GCP.
  • Implement and maintain infrastructure-as-code to automate training and deployment pipelines across many environments.
  • Operate and scale Kubernetes clusters with a focus on reliability, performance, availability, tenant isolation, and cost efficiency.
  • Build observability and alerting into distributed systems using Prometheus, Grafana, OpenTelemetry, and related tooling.
  • Partner closely with ML researchers and product teams to deliver production-grade infrastructure for advanced AI workloads.
  • Drive security and operational best-practices (RBAC, tenant isolation, cloud identity, etc.) across our platform.

What You Bring

  • 3–5 years building or operating cloud-native infrastructure in production.
  • Hands-on experience with at least one major cloud (AWS / Azure / GCP); multi-cloud exposure is a plus.
  • Operational experience with Kubernetes and production-grade clusters.
  • Proficiency with Infrastructure-as-Code (Terraform, Pulumi, etc.) and familiarity with GitOps tooling (ArgoCD, Flux, Argo Workflows).
  • Strong debugging, systems-thinking, and communication skills — you can drive technical decisions and explain trade-offs to multiple stakeholders.

Nice to Have

  • Experience operating multi-tenant Kubernetes for data / AI workloads.
  • Experience with (managed) Spark or large-scale data processing systems.
  • Familiarity with Kubernetes operators, controllers, and custom resources.
  • Deep experience with monitoring/tracing/logging stacks (Prometheus, OpenTelemetry, etc.)

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Top Skills

Argo Cd
AWS
Azure
Bash
Crossplane
Flux
GCP
Go
Grafana
Kubernetes
Opentelemetry
Prometheus
Pulumi
Python
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Mountain View, CA
38 Employees
Year Founded: 2021

What We Do

Democratizing AI on the Modern Data Stack!

The team behind PyG (PyG.org) is working on a turn-key solution for AI over large scale data warehouses. We believe the future of ML is a seamless integration between modern cloud data warehouses and AI algorithms. Our ML infrastructure massively simplifies the training and deployment of ML models on complex data.

With over 40,000 monthly downloads and nearly 13,000 Github stars, PyG is the ultimate platform for training and development of Graph Neural Network (GNN) architectures. GNNs -- one of the hottest areas of machine learning now -- are a class of deep learning models that generalize Transformer and CNN architectures and enable us to apply the power of deep learning to complex data. GNNs are unique in a sense that they can be applied to data of different shapes and modalities.

Similar Jobs

DatologyAI Logo DatologyAI

Software Engineer

Artificial Intelligence • Software
In-Office
Redwood City, CA, USA
24 Employees
180K-250K Annually

Watershed Logo Watershed

Software Engineer

Enterprise Web • Greentech • Software
Hybrid
San Francisco, CA, USA
200 Employees

PayPal Logo PayPal

Staff Software Engineer

Fintech • Payments
In-Office
3 Locations
34450 Employees
153K-262K Annually

NVIDIA Logo NVIDIA

Software Engineer

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
In-Office or Remote
Santa Clara, CA, USA
21960 Employees
148K-288K Annually

Similar Companies Hiring

Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account