Software Engineer - Cloud Infrastructure

Sorry, this job was removed at 06:42 p.m. (CST) on Thursday, Mar 26, 2026
Mountain View, CA, USA
Hybrid
145K-215K Annually
Artificial Intelligence • Cloud • Machine Learning • Software • Database
The Role
About Kumo.ai

Kumo is building the infrastructure layer for the next generation of enterprise AI — a platform that lets organizations turn their data into predictive intelligence instantly, without the heavy lifting of traditional ML pipelines. We have also built our own Relational Foundation Model that can provide predictions in seconds – no training, straight to business value!

Join a dynamic, rapidly expanding team of innovators from top-tier companies like Airbnb, LinkedIn, Pinterest, and Stanford, supported by the renowned Sequoia Capital. We're on the front lines of AI, solving some of its most challenging and impactful problems, and we've already delivered over $500M+ in tangible value to industry giants like Reddit, DoorDash, and Databricks. If you thrive in a fast-paced environment, are driven by ambitious goals, and crave an opportunity for massive impact, this is your chance to shape the future of AI.

The Opportunity

Kumo’s platform runs thousands of predictive workloads across multi-tenant Kubernetes clusters that form the backbone of our AI stack. As an Cloud Infrastructure Engineer you’ll own, scale, and optimize that platform — from real-time inference to large-scale training — with real production impact. You’ll make high-leverage architectural decisions, ship quickly, and collaborate across ML, product, and engineering teams to expand our multi-cloud capabilities. Expect to move fast, iterate often, and see your changes land in production within days — not quarters.

What You’ll Do

  • Design, build, and evolve Kumo’s multi-tenant infrastructure to support massive AI and data workloads across AWS, Azure, and GCP.
  • Implement and maintain infrastructure-as-code to automate training and deployment pipelines across many environments.
  • Operate and scale Kubernetes clusters with a focus on reliability, performance, availability, tenant isolation, and cost efficiency.
  • Build observability and alerting into distributed systems using Prometheus, Grafana, OpenTelemetry, and related tooling.
  • Partner closely with ML researchers and product teams to deliver production-grade infrastructure for advanced AI workloads.
  • Drive security and operational best-practices (RBAC, tenant isolation, cloud identity, etc.) across our platform.

What You Bring

  • 3–5 years building or operating cloud-native infrastructure in production.
  • Hands-on experience with at least one major cloud (AWS / Azure / GCP); multi-cloud exposure is a plus.
  • Operational experience with Kubernetes and production-grade clusters.
  • Proficiency with Infrastructure-as-Code (Terraform, Pulumi, etc.) and familiarity with GitOps tooling (ArgoCD, Flux, Argo Workflows).
  • Strong debugging, systems-thinking, and communication skills — you can drive technical decisions and explain trade-offs to multiple stakeholders.

Nice to Have

  • Experience operating multi-tenant Kubernetes for data / AI workloads.
  • Experience with (managed) Spark or large-scale data processing systems.
  • Familiarity with Kubernetes operators, controllers, and custom resources.
  • Deep experience with monitoring/tracing/logging stacks (Prometheus, OpenTelemetry, etc.)

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Similar Jobs

PayPal Logo PayPal

Staff Software Engineer

Fintech • Payments
In-Office
4 Locations
34450 Employees
170K-292K Annually

Notable (notablehealth.com) Logo Notable (notablehealth.com)

Staff Software Engineer

Artificial Intelligence • Software
Hybrid
San Mateo, CA, USA
291 Employees
182K-229K Annually

Altruist Logo Altruist

Senior Software Engineer

Fintech • Professional Services • Software
In-Office
Los Angeles, CA, USA
250 Employees
180K-225K Annually

Altruist Logo Altruist

Senior Software Engineer

Fintech • Professional Services • Software
In-Office
San Francisco, CA, USA
250 Employees
200K-250K Annually
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Mountain View, CA
38 Employees
Year Founded: 2021

What We Do

Democratizing AI on the Modern Data Stack! The team behind PyG (PyG.org) is working on a turn-key solution for AI over large scale data warehouses. We believe the future of ML is a seamless integration between modern cloud data warehouses and AI algorithms. Our ML infrastructure massively simplifies the training and deployment of ML models on complex data. With over 40,000 monthly downloads and nearly 13,000 Github stars, PyG is the ultimate platform for training and development of Graph Neural Network (GNN) architectures. GNNs -- one of the hottest areas of machine learning now -- are a class of deep learning models that generalize Transformer and CNN architectures and enable us to apply the power of deep learning to complex data. GNNs are unique in a sense that they can be applied to data of different shapes and modalities.

Similar Companies Hiring

Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account