Staff Site Reliability Engineer

Posted 7 Days Ago
Be an Early Applicant
Marina del Rey, CA
Hybrid
190K-210K Annually
Expert/Leader
Digital Media
Zefr is helping power the age of responsible marketing by enabling transparent, content-level targeting and measurement.
The Role
In this role, you'll design and manage cloud infrastructure, focus on CI/CD, and support observability for scalable applications while collaborating with engineering teams.
Summary Generated by Built In
What we do:

Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Zefr’s solutions empower brands to manage their content adjacency on scaled platforms such as YouTube, Meta, TikTok, and Snap, in accordance with industry standard frameworks. Through its patented AI technology, Zefr offers brands and agencies more accurate and transparent solutions for social walled gardens. The company is headquartered in Los Angeles, California, with additional locations across the globe.

What you’ll do:

As a Site Reliability Engineer at Zefr, you’ll apply your expertise in cloud infrastructure, CI/CD, Observability, and core SRE concepts, to deliver high-quality, reliable, and scalable solutions. A significant aspect of this role involves working closely with the rest of Zefr’s Engineering and Data Science teams, ensuring the specialized infrastructure required for our services is robust, efficient, and scalable.

We’re looking for someone to combine their technical expertise with strong leadership and a passion for continuous improvement and innovation. Zefr wants a candidate that champions reliability as a product feature, and can translate complex technical concepts into strategy. This is a role where you'll shape how we build and operate systems at scale.

  • Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.

  • Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.

  • Collaborate with other engineers, particularly the Machine Learning team, to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP.

  • Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.

  • Proactively maintain the health of production environments, including monitoring application performance and resource utilization.

  • Participate in 24/7 on-call rotation, respond to system performance issues and outages.

  • Debug code at the application and infrastructure level.

  • Mature our CI/CD workflows and release process.

  • Maintains a forward-thinking approach, actively researching and proposing new solutions.

  • Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.

Technology Stack at Zefr:

Core Infrastructure & Cloud Platforms:

  • Cloud Providers: Google Cloud Platform (primary), Amazon Web Services

  • Infrastructure as Code (IaC): Terraform, Terragrunt

  • Containerization & Orchestration: Docker, Kubernetes (experience with GKE and/or EKS expected), Helm, Kustomize

  • Service Mesh: Istio

CI/CD & Automation:

  • CI/CD Pipelines: GitHub Actions

  • GitOps / Continuous Delivery: Argo CD

  • Primary Scripting/Automation Language: Python

Observability & Monitoring:

  • Monitoring & Alerting: Prometheus, Chronosphere, Pagerduty

  • Telemetry Standards: OpenTelemetry

Application & Data Ecosystem (Supporting):

  • Application Languages/Frameworks: Python, FastAPI, Flask, Node.js, React

  • Data Streaming: Apache Kafka

  • Data Processing/Transformation: Pandas, DBT

  • Workflow Orchestration: Apache Airflow, Ray

Data Stores & Databases:

  • Relational Databases: PostgreSQL (including managed versions like AWS Aurora, GCP Cloud SQL)

  • NoSQL Databases: DynamoDB

  • Search Databases: OpenSearch

  • Vector Databases: Qdrant

  • Caching: Redis

  • Data Warehousing: Snowflake

What we’re looking for:
  • 7+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus)

  • Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)

  • Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi)

  • Production experience architecting, managing, deploying, and supporting container based workloads into Kubernetes clusters

  • Strong problem-solving experience, focusing on automation

  • Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning.

  • Heavy Production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems.

  • Knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies.

  • Strong written and verbal communication, organization, and documentation skills

Benefits (for US based employees):
  • Flexible PTO

  • Medical, dental, and vision insurance with FSA options

  • Company-paid life insurance

  • Paid parental leave

  • 401(k) with company match

  • Professional development opportunities

  • 10+ paid holidays off

  • Summer Fridays (we leave early)

  • In-office, hybrid, and fully-remote work options available

  • In-office lunches and lots of free food

  • Optional in-person and virtual events (we like to celebrate!)

Compensation (for US based employees):

The anticipated salary for this position is between $190,000 and $210,000. Within the range, individual pay is determined by factors such as job-related skills, experience, and relevant education or training. If your compensation expectations fall outside of this range, it may still be worth having a conversation.

Zefr is an equal opportunity employer that embraces diversity and inclusion in the workplace. We are committed to building a team that represents a variety of backgrounds, skills, and perspectives because we know this only makes us better. We strongly encourage women, persons of color, LGBTQIA+ individuals, persons with disabilities, members of ethnic minorities, foreign-born residents, and veterans to apply even if you do not meet 100% of the qualifications.

Top Skills

Amazon Web Services
Apache Airflow
Apache Kafka
Argo Cd
Chronosphere
Dbt
Docker
DynamoDB
Fastapi
Flask
Github Actions
Google Cloud Platform
Helm
Istio
Kubernetes
Kustomize
Node.js
Opensearch
Opentelemetry
Pagerduty
Pandas
Postgres
Prometheus
Python
Qdrant
Ray
React
Redis
Snowflake
Terraform
Terragrunt
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Marina del Rey, CA
200 Employees
Year Founded: 2009

What We Do

Zefr is the leading data and technology company that enables responsible marketing for brands, agencies and platforms. The company leverages a patented AI and machine learning engine, called Cognition AI, to offer brands and agencies more accurate and transparent activation and measurement solutions on scaled video platforms.

Why Work With Us

As employees work to bring responsible marketing to the digital advertising ecosystem, we work to improve the lives of our employees both at work and at home. We are committed to building a team that represents a variety of different backgrounds, skills, and perspectives because we know this only makes us better.

Gallery

Gallery

Similar Jobs

General Motors Logo General Motors

Staff Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
2 Locations
165000 Employees
184K-275K Annually

Zscaler Logo Zscaler

Site Reliability Engineer

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Hybrid
2 Locations
8697 Employees
119K-170K Annually

Mochi Health Logo Mochi Health

Site Reliability Engineer

Healthtech • Telehealth
Easy Apply
In-Office
San Francisco, CA, USA
70 Employees
250K-300K Annually
Easy Apply
In-Office
2 Locations
146 Employees

Similar Companies Hiring

Grocery TV Thumbnail
Software • Retail • Marketing Tech • Hardware • Digital Media • AdTech
Austin, TX
56 Employees
bet365 Thumbnail
Software • Gaming • Esports • Digital Media • Automation
Denver, Colorado
9000 Employees
Hedra Thumbnail
Software • News + Entertainment • Marketing Tech • Generative AI • Enterprise Web • Digital Media • Consumer Web
San Francisco, CA
14 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account