Site Reliability Engineer - Platform

Posted 9 Days Ago
Be an Early Applicant
Redwood City, CA
129K-169K Annually
Mid level
Artificial Intelligence • Big Data • Machine Learning • Software
The Role
The Site Reliability Engineer will manage and optimize Kubernetes clusters and cloud infrastructure, ensuring reliability and scalability. Responsibilities include monitoring clusters, automating infrastructure processes, collaborating with teams, and implementing security best practices.
Summary Generated by Built In

C3.ai, Inc. (NYSE:AI) is a leading Enterprise AI software provider for accelerating digital transformation. The proven C3 AI Platform provides comprehensive services to build enterprise-scale AI applications more efficiently and cost-effectively than alternative approaches. The C3 AI Platform supports the value chain in any industry with prebuilt, configurable, high-value AI applications for reliability, fraud detection, sensor network health, supply network optimization, energy management, anti-money laundering, and customer engagement. Learn more at: C3 AI

We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team to manage, monitor, and optimize our C3 clusters on Kubernetes. The ideal candidate will have a deep understanding of Kubernetes, Cloud Infrastructure, and Infrastructure as Code (IaC) practices. You will be responsible for ensuring the reliability, scalability of our Kubernetes clusters and Cloud Infrastructure

Responsibilities:

  • Monitor and Manage Kubernetes Clusters: Ensure the stability, health, and scalability of Kubernetes Clusters, deploying applications and services on Kubernetes.
  • Kubernetes Management: Deploy, monitor, and scale applications on Kubernetes clusters. Maintain Helm charts, manage services, and ensure resource allocation for optimal cluster performance.
  • Cloud Infrastructure Management: Work with leading Cloud Platforms (AWS, GCP, Azure) to set up, configure, and manage infrastructure resources using Infrastructure as Code (Terraform, CloudFormation, etc.).
  • Monitoring & Incident Response: Set up monitoring solutions, define alerts, and manage the incident response process for any issues related to Jenkins, C3, or Kubernetes clusters.
  • Automate Infrastructure Processes: Build automation tools for scaling, monitoring, and maintaining infrastructure using modern tools like Terraform, Ansible, or equivalent.
  • Collaborate Across Teams: Work closely with development, services, and operations teams to ensure a seamless integration between application development and infrastructure.
  • Security & Compliance: Ensure all systems follow best practices in terms of security and compliance with relevant regulations. This includes role-based access, encryption, and automated vulnerability scanning.

Qualifications:

  • 3+ years of experience as an SRE, DevOps Engineer, or related role.
  • Hands-on experience with Kubernetes in production environments (managing clusters, deployments, services, and pods).
  • Proficiency in cloud platforms like AWS, GCP, or Azure, including managing infrastructure via IaC tools like Terraform, CloudFormation, or equivalent.
  • Familiarity with monitoring tools like Prometheus, Grafana or equivalent.
  • Experience with Helm and managing Kubernetes applications via Helm charts.
  • Strong scripting and automation skills in languages like Bash, Python, or Groovy.
  • Experience with CI/CD tools, GitOps, and best practices for continuous integration and delivery pipelines.
  • Understanding of networking concepts and security best practices in a cloud-native environment.
  • Incident management experience, including setting up on-call rotations, managing runbooks, and post-incident reviews.

C3 AI provides excellent benefits, a competitive compensation package and generous equity plan. 

California Pay Range

$129,000$169,000 USD

C3 AI is proud to be an Equal Opportunity and Affirmative Action Employer. We do not discriminate on the basis of any legally protected characteristics, including disabled and veteran status. 

Top Skills

Bash
Groovy
Kubernetes
Python
The Company
Redwood City, CA
923 Employees
Hybrid Workplace
Year Founded: 2009

What We Do

C3 AI is the leading Enterprise AI software provider for accelerating digital transformation.

Digital transformation is about leveraging big data and the internet of things to improve performance of assets and predict shortfalls before they happen — all through artificial intelligence and machine learning. Get ahead of supply chain delays before they affect your delivery deadlines. Predict maintenance needs to increase asset uptime, replacing or repairing parts before failure. Reduce energy costs and track sustainability goals in real time, improving building operations and reducing greenhouse gas emissions. Connect disparate health record systems to optimize patient visits and decrease waitlist time.

At the core of all C3 AI products is a proprietary, model-driven AI architecture that dramatically enhances data science and application development. The C3 AI Platform allows customers to develop, deploy, and operate large-scale AI, predictive analytics, and IoT applications. And a broad portfolio of turnkey AI applications allows for even faster development and deployment. From reliability and readiness to supply chain optimization and energy management, C3 AI has deep industry expertise to get your enterprise started on its digital transformation.

Similar Jobs

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer , Cloud Platform - Remote

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
San Francisco, CA, USA
3000 Employees
173K-242K Annually

Crunchyroll Logo Crunchyroll

Staff Site Reliability Engineer - Data Engineering, Platform

Digital Media • eCommerce • Gaming • Mobile • News + Entertainment
Remote
San Francisco, CA, USA
1200 Employees
191K-239K Annually

Atlassian Logo Atlassian

Principal Site Reliability Engineer

Cloud • Information Technology • Productivity • Security • Software • App development • Automation
Remote
San Francisco, CA, USA
11000 Employees
167K-269K Annually

Cisco Meraki Logo Cisco Meraki

Lead Site Reliability Engineer - Remote

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
Easy Apply
Remote
San Francisco, CA, USA
3000 Employees
173K-242K Annually

Similar Companies Hiring

InCommodities Thumbnail
Renewable Energy • Machine Learning • Information Technology • Energy • Automation • Analytics
Austin, TX
234 Employees
RunPod Thumbnail
Software • Infrastructure as a Service (IaaS) • Cloud • Artificial Intelligence
Charlotte, North Carolina
53 Employees
Hedra Thumbnail
Software • News + Entertainment • Marketing Tech • Generative AI • Enterprise Web • Digital Media • Consumer Web
San Francisco, CA
14 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account