Senior Platform & Reliability Engineer (SRE)

Reposted 4 Days Ago
San Francisco, CA, USA
Hybrid
Senior level
Artificial Intelligence • Information Technology • Software
The Role
Lead end-to-end platform reliability: define SLIs/SLOs, harden production architecture, ensure Kubernetes runtime and queue safety, run incident command for Sev1/Sev2, own observability/on-call/runbooks, and gate risky releases while delivering a prioritized reliability roadmap.
Summary Generated by Built In

Agency Notice: We are not currently working with recruiting agencies for this role. Please do not contact Vizcom employees regarding this position. Any resumes submitted without a prior agreement will be considered unsolicited.

About Vizcom

Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure.

We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale.

Role Mission

Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails.

Compensation

$200,000 – $250,000 base salary + meaningful equity


What You’ll Own

  • Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.

  • Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.

  • Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.

  • Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation.

  • Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).

  • Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.

  • Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.

Traits We’re Looking For

  • Calm, structured incident commander under pressure.

  • Thinks in failure modes and blast radius by default.

  • Pragmatic: can stabilize quickly, then implement durable fixes.

  • High ownership and strong written communication.

First 90 Days
  • Establish baseline reliability metrics and identify top platform risks.

  • Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).

  • Deliver high-impact hardening fixes across probes/startup paths/queue safety.

  • Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.

If possible please include one incident you personally led and send to [email protected] :

1) what failed,

2) how you contained it,

3) what permanent fixes you shipped, and measured.

Skills Required

  • Set and enforce SLIs, SLOs, and error budgets for critical user flows
  • Design failure isolation across API, workers, queues, and dependencies
  • Kubernetes runtime reliability: probes, rollout/rollback, graceful shutdown, scaling policies
  • Queue and job safety with BullMQ/Redis, including poison pill containment and workload isolation
  • Lead Sev1/Sev2 incident response: containment, communications, technical direction, RCA, and corrective action execution
  • Own observability quality, on-call effectiveness, runbooks, and postmortem discipline
  • Authority to gate risky deploys and enforce release reliability guardrails
  • Calm, structured incident command under pressure with strong written communication and high ownership
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
56 Employees
Year Founded: 2021

What We Do

Building tools that shorten the distance between having ideas and bringing them to life. https://linktr.ee/vizcom_

Similar Jobs

In-Office
San Mateo, CA, USA
4731 Employees
130K-200K Annually

Zilliz Logo Zilliz

Senior Site Reliability Engineer

Artificial Intelligence • Machine Learning • Database
Hybrid
Redwood City, CA, USA
75 Employees
175K-225K Annually

Vercel Logo Vercel

Growth Marketing Manager

Artificial Intelligence • Cloud • Software
Easy Apply
Hybrid
3 Locations
176K-220K Annually

Airwallex Logo Airwallex

Manager, Paid Media

Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Remote or Hybrid
San Francisco, CA, USA
2200 Employees

Similar Companies Hiring

Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account