Vizcom

Senior Platform & Reliability Engineer (SRE)

Reposted 6 Hours Ago

San Francisco, CA, USA

Hybrid

Senior level

Artificial Intelligence • Information Technology • Software

The Role

Lead end-to-end platform reliability: define SLIs/SLOs, harden production architecture, ensure Kubernetes runtime and queue safety, run incident command for Sev1/Sev2, own observability/on-call/runbooks, and gate risky releases while delivering a prioritized reliability roadmap.

Summary Generated by Built In

Agency Notice: We are not currently working with recruiting agencies for this role. Please do not contact Vizcom employees regarding this position. Any resumes submitted without a prior agreement will be considered unsolicited.

About Vizcom

Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure.

We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale.

Role Mission

Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails.

Compensation

$200,000 – $250,000 base salary + meaningful equity

What You’ll Own

Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.

Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.

Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.

Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation.

Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).

Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.

Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.

Traits We’re Looking For

Calm, structured incident commander under pressure.
Thinks in failure modes and blast radius by default.
Pragmatic: can stabilize quickly, then implement durable fixes.
High ownership and strong written communication.

First 90 Days

Establish baseline reliability metrics and identify top platform risks.

Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).

Deliver high-impact hardening fixes across probes/startup paths/queue safety.

Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.

If possible please include one incident you personally led and send to [email protected] :

1) what failed,

2) how you contained it,

3) what permanent fixes you shipped, and measured.

Skills Required

Set and enforce SLIs, SLOs, and error budgets for critical user flows
Design failure isolation across API, workers, queues, and dependencies
Kubernetes runtime reliability: probes, rollout/rollback, graceful shutdown, scaling policies
Queue and job safety with BullMQ/Redis, including poison pill containment and workload isolation
Lead Sev1/Sev2 incident response: containment, communications, technical direction, RCA, and corrective action execution
Own observability quality, on-call effectiveness, runbooks, and postmortem discipline
Authority to gate risky deploys and enforce release reliability guardrails
Calm, structured incident command under pressure with strong written communication and high ownership

View all jobs at Vizcom

View Vizcom Profile

Report Job

Am I A Good Fit?

beta