What You’ll Do
- Design, build, and evolve Kumo’s multi-tenant infrastructure to support massive AI and data workloads across AWS, Azure, and GCP.
- Implement and maintain infrastructure-as-code to automate training and deployment pipelines across many environments.
- Operate and scale Kubernetes clusters with a focus on reliability, performance, availability, tenant isolation, and cost efficiency.
- Build observability and alerting into distributed systems using Prometheus, Grafana, OpenTelemetry, and related tooling.
- Partner closely with ML researchers and product teams to deliver production-grade infrastructure for advanced AI workloads.
- Drive security and operational best-practices (RBAC, tenant isolation, cloud identity, etc.) across our platform.
What You Bring
- 3–5 years building or operating cloud-native infrastructure in production.
- Hands-on experience with at least one major cloud (AWS / Azure / GCP); multi-cloud exposure is a plus.
- Operational experience with Kubernetes and production-grade clusters.
- Proficiency with Infrastructure-as-Code (Terraform, Pulumi, etc.) and familiarity with GitOps tooling (ArgoCD, Flux, Argo Workflows).
- Strong debugging, systems-thinking, and communication skills — you can drive technical decisions and explain trade-offs to multiple stakeholders.
Nice to Have
- Experience operating multi-tenant Kubernetes for data / AI workloads.
- Experience with (managed) Spark or large-scale data processing systems.
- Familiarity with Kubernetes operators, controllers, and custom resources.
- Deep experience with monitoring/tracing/logging stacks (Prometheus, OpenTelemetry, etc.)
Top Skills
What We Do
Democratizing AI on the Modern Data Stack!
The team behind PyG (PyG.org) is working on a turn-key solution for AI over large scale data warehouses. We believe the future of ML is a seamless integration between modern cloud data warehouses and AI algorithms. Our ML infrastructure massively simplifies the training and deployment of ML models on complex data.
With over 40,000 monthly downloads and nearly 13,000 Github stars, PyG is the ultimate platform for training and development of Graph Neural Network (GNN) architectures. GNNs -- one of the hottest areas of machine learning now -- are a class of deep learning models that generalize Transformer and CNN architectures and enable us to apply the power of deep learning to complex data. GNNs are unique in a sense that they can be applied to data of different shapes and modalities.









