We’re at the forefront of a once in a generational change in the broadband industry. Join us as we innovate, help our customers reach their potential, and connect underserved communities with unrivaled digital experiences.
Role Overview
We are seeking a Staff Site Reliability Engineer (SRE) to lead our global platform reliability and drive our next-generation observability strategy on Google Cloud Platform (GCP). In this role, you will leverage Grafana Labs' complete telemetry stack and AIOps methodologies to build intelligent, self-healing infrastructure. You will bring deep expertise in scaling enterprise-grade Google Kubernetes Engine (GKE) topologies, managing high-throughput Kafka event streams, and maintaining high-performance PostgreSQL, AlloyDB, and BigQuery ecosystems at massive scale. Crucially, you will provide deep technical leadership across the entire networking stack, diagnosing complex issues from physical-layer transport up to application-layer protocols.
This position is 100% fully remote. You can work from anywhere in the United States or Canada with a reliable internet connection, collaborating with a distributed engineering organization across multiple time zones.
Key Responsibilities:
Full-Stack Network Architecture: Architect, optimize, and troubleshoot complex networking infrastructure spanning Layer 1 through Layer 7, ensuring low-latency data transport, secure edge routing, and seamless service mesh integration.
Grafana Stack Architecture: Design, scale, and optimize our unified observability platform using the Grafana Labs suite (Grafana, Mimir, Loki, Tempo, and Beyla).
AIOps & Intelligent Alerting: Deploy machine learning models and automated anomaly detection to cut through telemetry noise, reduce alert fatigue, and predict network or data pipeline bottlenecks.
GKE Platform Engineering: Drive the architecture, scaling, security, and networking of production Google Kubernetes Engine (GKE) clusters.
Data & Event Streaming Reliability: Tune, and maintain high-throughput Apache Kafka clusters to guarantee low-latency event delivery and high availability.
Large-Scale Database Management: Ensure the performance, scalability, and disaster recovery readiness of our transactional and analytical data tiers across PostgreSQL, AlloyDB, and BigQuery.
Automated Incident Response: Integrate AIOps insights with Grafana workflows to automate triage, accelerate root-cause analysis, and trigger auto-remediation scripts.
Technical Leadership: Champion the long-term technical roadmap for distributed infrastructure engineering and GCP cloud-native observability standards.
Mentorship: Coach senior and junior engineers on advanced debugging techniques, distributed systems thinking, and intelligent operations across a distributed workforce.
Required Qualifications
Location/Work Style: Proven track record of high autonomy and successful delivery in a 100% remote engineering environment.
Experience: 8+ years in SRE, Production Engineering, or Distributed Systems infrastructure roles.
Networking Expertise (L1-L7): Deep technical knowledge and debugging mastery across all OSI layers, including:
L1-L3: Physical/fiber infrastructure awareness, switching, and advanced routing protocols (BGP, OSPF).
L4: Transport layer tuning (TCP congestion control algorithms, UDP, QUIC).
L5-L7: Session management, TLS termination, DNS architecture, and advanced application protocols (HTTP/3, gRPC).
Orchestration & Containerization: Expert-level mastery of Google Kubernetes Engine (GKE) internals, custom controllers, multi-cluster networking, and GitOps workflows.
Data Infrastructure: Proven track record managing high-throughput Apache Kafka pipelines and large-scale data environments across PostgreSQL, AlloyDB, and BigQuery.
Grafana Ecosystem: Deep, hands-on experience deploying and managing Grafana Enterprise/Cloud, Prometheus/Mimir, Loki, and Tempo at scale.
AIOps Implementation: Track record applying AI/ML techniques for time-series anomaly detection, log clustering, and correlation (e.g., Grafana Adaptive Metrics, BigPanda).
Infrastructure as Code: Advanced, production-scale expertise utilizing HashiCorp Terraform exclusively to provision and manage multi-region GCP cloud architectures.
Programming: High proficiency in Go and Python for building custom infrastructure tooling, Kubernetes operators, and data integration scripts.
Preferred Attributes
Remote Communicator: Exceptional written and verbal communication skills, with an emphasis on creating clear documentation for asynchronous alignment.
GCP Expert: Deep knowledge of Google Cloud architectural best practices, Cloud SDN, Cloud Armor, Interconnect, Identity and Access Management (IAM), and cost optimization.
Systems Thinker: Deep understanding of Linux internals, eBPF-based monitoring, kernel-level networking, and packet analysis tools (Wireshark, tcpdump).
#LI-Remote
The base pay range for this position varies based on the geographic location. More information about the pay range specific to candidate location and other factors will be shared during the recruitment process. Individual pay is determined based on location of residence and multiple factors, including job-related knowledge, skills and experience.
San Francisco Bay Area:
156,400 - 265,700 USD AnnualAll Other US Locations:
As a part of the total compensation package, this role may be eligible for a bonus. For information on our benefits click here.
Skills Required
- Proven track record delivering in a 100% remote engineering environment (US or Canada)
- 8+ years in SRE, Production Engineering, or Distributed Systems infrastructure roles
- Deep technical knowledge and debugging mastery across OSI layers L1-L7 (including physical/fiber, switching, routing, BGP, OSPF)
- Transport layer tuning expertise (TCP congestion control, UDP, QUIC) and session/application layer protocols (TLS termination, DNS, HTTP/3, gRPC)
- Expert-level mastery of GKE internals, custom controllers, multi-cluster networking, and GitOps workflows
- Proven track record managing high-throughput Apache Kafka pipelines for low-latency event delivery and high availability
- Experience ensuring performance, scalability, and DR readiness for PostgreSQL, AlloyDB, and BigQuery at scale
- Deep, hands-on experience deploying and managing Grafana Enterprise/Cloud, Prometheus/Mimir, Loki, and Tempo at scale
- Track record applying AI/ML for time-series anomaly detection, log clustering, and correlation (e.g., Grafana Adaptive Metrics, BigPanda)
- Advanced, production-scale expertise using HashiCorp Terraform exclusively to provision and manage multi-region GCP architectures
- High proficiency in Go and Python for building custom infrastructure tooling, Kubernetes operators, and data integration scripts
- Exceptional written and verbal communication skills for asynchronous remote collaboration and clear documentation
- Deep GCP knowledge: Cloud SDN, Cloud Armor, Interconnect, IAM, and cost optimization
- Deep understanding of Linux internals, eBPF-based monitoring, kernel-level networking, and packet analysis tools (Wireshark, tcpdump)
- Experience mentoring and coaching senior and junior engineers in distributed systems and observability practices
Calix Compensation & Benefits Highlights
The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Calix and has not been reviewed or approved by Calix.
-
Flexible Benefits — Remote‑first policies include home‑internet reimbursement, home‑office furniture support, and work‑from‑anywhere flexibility. Feedback suggests these options make the package adaptable across locations and work styles.
-
Healthcare Strength — Coverage spans medical, dental, and vision for employees and dependents alongside EAP and virtual therapy/coaching. Wellbeing elements like lifestyle allowances, recharge days, and no‑internal‑meeting days further bolster health support.
-
Leave & Time Off Breadth — Paid vacation, wellness days, holidays, bereavement and jury‑duty leave offer broad time‑off access. Parental/bonding and caregiver leave, plus adoption assistance and medical‑travel coverage, extend support through major life events.
Calix Insights
What We Do
Innovative communications service providers rely on Calix platforms to help them master and monetize the complex infrastructure between their subscribers and the cloud. Calix is the leading global provider of the cloud and software platforms, systems, and services required to deliver the unified access network and smart premises of tomorrow. Our platforms and services help our customers build next generation networks by embracing a DevOps operating model, optimize the subscriber experience by leveraging big data analytics, and turn the complexity of the smart home and business into new revenue streams.







