Senior Platform & Reliability Engineer

Posted 21 Hours Ago
Be an Early Applicant
San Francisco, CA, USA
Hybrid
Senior level
Artificial Intelligence • Machine Learning • Software • Generative AI
The Role
Help design, scale, and improve platform reliability: define SLOs/SLIs, run on-call and incident response, build observability, improve resilience to external dependencies, enhance CI/CD and deploy safety, optimize cost and capacity, and influence infrastructure architecture.
Summary Generated by Built In

🧑🏼 💻 Senior Platform & Reliability Engineer

🎨 About OpenArt

OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters, and stories with unprecedented speed and imagination.

We believe the future of creativity is AI-native, and we're shaping that future.

🚀 Why Join OpenArt

  • Small team, massive surface area, senior engineers own real systems, notslices.

  • Ship at real scale, your work goes to millions of users, fast.

  • Founder-led engineering culture, both founders are technical and deeplyinvolved in product and architecture.

  • AI-native product, you’ll design how cutting-edge AI models are exposed asreal user experiences.

  • High ownership, low process, we value judgment, clarity, and speed overbureaucracy.

  • Senior Platform & Reliability Engineer 1

  • 7-10X growth in revenue for the past 2 years. Now you’ll play a critical role inhelping the company scale to the next stage.

🎯 About the Role

We’re looking for a Senior Platform & Reliability Engineer to help design, scale, and improve the reliability of our infrastructure, from architectural decisions to hands-on implementation, observability, and cost optimization.

This is not a traditional ops or DevOps role. You’ll work across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiency—in a fast-moving, AI-native environment.

You’ll partner closely with product engineers to evolve the platform that powers OpenArt, contributing to key decisions around infrastructure architecture, improving multi-provider AI reliability, and helping us scale systems to millions of users—while raising the overall engineering bar.

🛠 What You’ll Do

  • Define and operationalize SLOs/SLIs across critical user journeys (generation, editing, payments/credits, uploads), and use them to guide prioritization and tradeoffs.

  • Participate in an on-call rotation and improve incident response (alert quality, run books, escalation paths), including leading blameless postmortems and driving follow-through on action items.

  • Improve system resilience at external boundaries (AI providers, storage, etc.),including timeouts, retries, circuit breakers, and fallback strategies. Build and maintain end-to-end observability (logs, metrics, traces, dashboards) so engineers can quickly understand “what broke” and “why.”

  • Strengthen deploy safety through CI/CD improvements, automated rollbacks, canary releases, and feature flag patterns.

  • Contribute to the evolution of our infrastructure architecture, helping evaluate when to extend serverless patterns vs. adopt containerized or more managed approaches as we scale.

  • Improve cost visibility and efficiency, including per-request cost attribution, caching strategies, and capacity planning.

  • Act as a strong technical contributor, helping improve engineering practices, tooling, and system design decisions across the team.

🧑 💻 What We’re Looking For

Core Requirements

  • 5+ years building and operating production systems where reliability and scaling are important.

  • Strong software engineering skills — you can build and ship production code, not just configure infrastructure.

  • Experience with cloud-native systems (AWS or GCP), including serverless/event-driven architectures and at least one container-based approach (e.g., ECS/Fargate, Cloud Run, Kubernetes).

  • Solid understanding of observability and reliability practices: metrics, alerting, tracing, and incident response.

  • Experience designing resilient systems with external dependencies (timeouts, retries/backoff, idempotency, circuit breakers).

  • Ability to communicate technical tradeoffs clearly to engineers across different domains.

  • Comfortable operating in ambiguous, fast-moving environments and taking ownership of problems.

    Nice to Have

  • Experience building internal platform abstractions (e.g., job orchestration, APIlayers, workflow systems) that improve team velocity.

  • Track record of improving reliability metrics (e.g., MTTR, SLO attainment, latency) or reducing infrastructure cost.

  • Experience working in a startup or high-growth environment, with broad ownership across systems.

⚙ Tech Stack You’ll Work With

GCP, Cloud Run, Modal, Upstash, Sentry, Amplitude, Firebase, Redis, React /Next.js, Node.js, TypeScript, Python, etc.

💰 Compensation

  • Competitive base salary and bonus program

  • Equity - meaningful ownership in what you build

  • High autonomy, high growth environment

🌍 Work Setup

  • Bay Area preferred (hybrid allowed)

  • Visa sponsorship available

  • We’ll consider remote

Skills Required

  • 5+ years building and operating production systems where reliability and scaling are important.
  • Strong software engineering skills; able to build and ship production code.
  • Experience with cloud-native systems (AWS or GCP), including serverless/event-driven architectures and at least one container-based approach (ECS/Fargate, Cloud Run, Kubernetes).
  • Solid understanding of observability and reliability practices: metrics, alerting, tracing, and incident response.
  • Experience designing resilient systems with external dependencies (timeouts, retries/backoff, idempotency, circuit breakers).
  • Ability to communicate technical tradeoffs clearly to engineers across different domains.
  • Comfortable operating in ambiguous, fast-moving environments and taking ownership of problems.
  • Experience building internal platform abstractions (job orchestration, API layers, workflow systems) that improve team velocity.
  • Track record of improving reliability metrics (MTTR, SLO attainment, latency) or reducing infrastructure cost.
  • Experience working in a startup or high-growth environment, with broad ownership across systems.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
20 Employees
Year Founded: 2022

What We Do

OpenArt is a San Francisco-based AI creativity platform that enables users to generate, edit, and explore images, videos, characters, and audio using advanced AI models. Founded by ex-Googlers, the company's mission is to build the future of storytelling with AI, providing tools that make high-quality generative art and content creation accessible to both hobbyists and professional creators worldwide.

Similar Jobs

Vizcom Logo Vizcom

Reliability Engineer

Artificial Intelligence • Information Technology • Software
Hybrid
San Francisco, CA, USA
56 Employees

Zilliz Logo Zilliz

Senior Site Reliability Engineer

Artificial Intelligence • Machine Learning • Database
Hybrid
Redwood City, CA, USA
75 Employees
175K-225K Annually

Zscaler Logo Zscaler

Principal Production Engineer

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Remote or Hybrid
San Jose, CA, USA
8697 Employees
165K-235K Annually

Zscaler Logo Zscaler

Sr. Production Engineer

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Remote or Hybrid
San Jose, CA, USA
8697 Employees
119K-170K Annually

Similar Companies Hiring

Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
LTX Thumbnail
Conversational AI • Generative AI
Jerusalem, Israel
360 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account