Description
We are looking for a DevOps Tech Lead to take ownership of our Cloud Infrastructure and Platform Engineering strategy, enabling high-scale, cutting-edge GenAI products running across 40+ Kubernetes clusters on GCP and AWS.
This role is a blend of hands-on engineering and technical leadership, requiring deep expertise in cloud-native technologies, Kubernetes at scale, and modern DevOps principles. You will work closely with engineering teams to design and implement scalable infrastructure solutions, optimize developer workflows, and ensure reliability and efficiency across our platform.
Role and Responsibilities
- Cloud & Kubernetes Expertise: Design and implement highly scalable multi-cluster Kubernetes environments across GCP & AWS.
- Developer Experience & Enablement: Lead the development of self-service tools and automation that improve efficiency for R&D teams.
- Incident & Reliability Engineering: Work with engineering teams to optimize cost, performance, and reliability of production infrastructure through monitoring, capacity planning, and scaling strategies.
- Security & Governance: Contribute to best practices for RBAC, IAM, cloud security, and compliance while ensuring infrastructure reliability.
- Automation & Infrastructure as Code: Drive adoption of GitOps workflows and Infrastructure as Code (Terraform, Helm, Crossplane) to enhance automation and consistency.
- Mentorship & Team Growth: Provide technical mentorship within the platform engineering team and contribute to knowledge-sharing across R&D.
- Cross-Team Collaboration: Work closely with engineering teams to align cloud infrastructure goals with business needs and reliability requirements.
Technology Assessment: Assess and advocate for new technologies that improve reliability, efficiency, and scalability within the platform.
Requirements
Technical Expertise:
- 8+ years of DevOps, SRE, or Platform Engineering experience.
- 6+ years working with public cloud platforms (AWS/GCP) at scale.
- Deep Kubernetes expertise, including managing large-scale, multi-cluster enterprise-grade Kubernetes environments.
- Experience designing and managing Custom Resource Definitions (CRDs) and custom controllers.
- Strong background in Infrastructure as Code (Terraform, Helm) and GitOps principles (ArgoCD, Crossplane, FluxCD, etc.).
- Hands-on experience in observability & monitoring (Prometheus, Grafana, Datadog, OpenTelemetry, etc.).
- Proficiency in scripting & automation (Python, Go, Bash) for infrastructure automation.
- Expertise in cloud networking (VPC, load balancers, service meshes) and security best practices (RBAC, IAM, security groups, network policies, etc.).
- Experience with CI/CD pipelines, optimizing for performance, security, and developer velocity.
Leadership & Execution:
- Ability to design and implement platform solutions, working closely with engineering teams.
- Experience mentoring engineers through code reviews, technical talks, documentation, and hands-on collaboration, while sharing knowledge across teams.
- Strong incident management skills, including on-call experience, root cause analysis, and postmortems.
- Passion for automation, self-service, and building internal tools to streamline workflows.
- Influences engineering teams by driving adoption of DevOps best practices, ensuring a culture of automation, collaboration, and continuous improvement.
Nice-to-Have:
- Experience with self-hosted on-prem deployments and managed private VPC deployments (Bring Your Own Cloud models).
- Advanced expertise in Helm and Crossplane for Kubernetes resource management.
- Experience in GenAI or large-scale SaaS platforms.
- Familiarity with SQL/NoSQL databases and distributed systems.
- DevSecOps experience, with a strong understanding of security automation and compliance frameworks.
About Us
AI21 Labs is pioneering the development of Foundation Models and AI Systems for enterprises, accelerating the adoption of Generative AI in production.
Established in 2017 by AI visionaries Prof. Amnon Shashua, Prof. Yoav Shoham, and Ori Goshen, our mission is to equip businesses with cutting-edge LLMs and AI capabilities. Backed by leading investors like Pitango, Google, Nvidia, Intel Capital, and Comcast Ventures.
Join us on this exciting journey and advance your career with AI21 Labs!
Top Skills
What We Do
AI21 is pioneering the development of enterprise AI Systems and foundation models. Our mission is to transform cutting-edge deep tech research into enterprise-ready AI systems. We offer privately deployed models with unmatched security, privacy and reliability with tailored solutions for every organization. Founded in 2017, AI21 has raised $336 million from leading investors including NVIDIA, Google and Intel.