What You’ll Own
- Set the technical vision and roadmap for Kumo’s multi-tenant infrastructure across AWS, Azure, and GCP, balancing scalability, reliability, cost, and security.
- Lead architecture and design for critical systems: Kubernetes-based multi-tenancy, real-time inference clusters, training pipelines, and CI/CD for large ML workloads.
- Hands-on implementation: build and evolve IaC, GitOps flows, cluster autoscaling, and automation that reduce toil and accelerate developer productivity.
- Define and drive SLOs, SLIs, and capacity planning; lead incident response, postmortems, and systemic remediation.
- Own cost optimization at scale — from resource scheduling to spot/commit strategies and cross-cloud lifecycle management.
- Mentor and grow engineers: set standards for architecture reviews, design docs, code quality, and operational excellence.
- Hire and help scale the team — participate in recruiting, interviewing, and onboarding top-tier infrastructure talent.
What You Bring
- 5-8+ years building and operating production cloud-native infrastructure; proven track record leading infrastructure initiatives end-to-end.
- Deep, practical experience with Kubernetes at scale (multi-tenant environments, cluster federation, or large fleet operations).
- Strong multi-cloud operational experience (designing and running services across AWS/Azure/GCP) and cloud cost management.
- Demonstrated systems design skills for distributed systems, making architectural trade-offs and comfortable shipping code in a high-velocity environment (Python, Go, or similar) and reviewing complex PRs.
- Proficiency in Go, Python, Rust or similar languages for automation tooling.
- Excellent communicator: able to influence across engineering, ML science, product, and leadership — and to write clear design docs and trade-off analyses.
Nice to Have
- Experience building infrastructure for ML/AI platforms or relational foundation models.
- Background with Spark or large-scale data processing platforms (managed or self-hosted).
- Familiarity with Kubernetes operators, controllers, CRDs, or service mesh patterns.
- Expertise with Infrastructure-as-Code (Terraform/Pulumi) and GitOps (ArgoCD, Flux, Argo Workflows) in production.
- Experience with tenant isolation, zero-trust identity models, and cloud security/compliance frameworks.
- Prior experience building and scaling an infrastructure team (e.g., hiring, mentoring, org design).
Top Skills
What We Do
Democratizing AI on the Modern Data Stack!
The team behind PyG (PyG.org) is working on a turn-key solution for AI over large scale data warehouses. We believe the future of ML is a seamless integration between modern cloud data warehouses and AI algorithms. Our ML infrastructure massively simplifies the training and deployment of ML models on complex data.
With over 40,000 monthly downloads and nearly 13,000 Github stars, PyG is the ultimate platform for training and development of Graph Neural Network (GNN) architectures. GNNs -- one of the hottest areas of machine learning now -- are a class of deep learning models that generalize Transformer and CNN architectures and enable us to apply the power of deep learning to complex data. GNNs are unique in a sense that they can be applied to data of different shapes and modalities.

.png)







