Site Reliability Engineer

Reposted Yesterday
San Francisco, CA, USA
In-Office
Senior level
Artificial Intelligence • Healthtech
The leading AI workflow automation platform built for healthcare
The Role
The Site Reliability Engineer will enhance system reliability, define observability standards, respond to incidents, and collaborate with engineering teams on performance and compliance improvements.
Summary Generated by Built In

About Plenful

Plenful is on a mission to move pharmacy forward through intelligent automation. We build AI-powered software that eliminates administrative burden, strengthens compliance, and unlocks revenue across critical pharmacy workflows, solving one of the biggest challenges in healthcare today: delayed patient care.


Built by a passionate team of former healthcare operators and world-class AI technologists, Plenful combines deep domain expertise with enterprise-grade technology to automate complex workflows across intake authorization, 340B program optimization, and pharmacy revenue reconciliation. Our AI platform is trusted by 95+ leading healthcare organizations to power smarter, faster, and more resilient pharmacy operations.


Backed by leading investors including Notable Capital, Bessemer Venture Partners, and TQ Ventures, Plenful is building the institutional memory for healthcare and powering the most complex, highest ROI healthcare workflows. We’re actively hiring as we continue to scale.

About the role
We’re hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of Plenful’s production systems as we continue to grow.


This role is centered on operating real systems at scale — not just building infrastructure, but deeply understanding how it behaves under load, fails in production, and recovers. You’ll define reliability standards, own production health, and build the feedback loops that make our systems more resilient over time.

You’ll work closely with backend, data, and ML engineers to ensure our platform is highly available, measurable, and continuously improving. This includes everything from incident response and performance debugging to SLO design and system-level optimization.
What You’ll Do

Reliability Engineering & System Ownership
  • Define and implement SLIs, SLOs, and error budgets across core services
  • Own production system health, including uptime, latency, and availability targets
  • Continuously improve system resilience through proactive reliability work
  • Identify and mitigate single points of failure across distributed systems


Production Operations & Incident Response

  • Participate in and improve on-call rotations and incident response processes
  • Lead incident triage, mitigation, and resolution in real time
  • Conduct blameless postmortems and ensure follow-through on action items
  • Build tooling and automation to reduce MTTR (Mean Time to Recovery)


Observability & System Insight

  • Design and evolve observability systems across:
    • Metrics, logs, and distributed tracing (OpenTelemetry)
    • Tooling including Datadog, CloudWatch, Grafana, Sentry
  • Improve signal quality to reduce noise and alert fatigue
  • Develop dashboards and alerts that reflect real system health and user impact
  • Use observability data to drive performance and reliability improvements


Performance & Scalability

  • Analyze system performance under load and identify bottlenecks
  • Optimize latency, throughput, and resource utilization across:
    • Serverless systems (AWS Lambda)
    • Containerized services (ECS)
    • Data systems (Aurora Postgres, ClickHouse)
  • Partner with engineering teams to improve system efficiency and scaling behavior


Automation & Reliability Tooling

  • Build automation to eliminate repetitive operational work
  • Improve deployment safety through reliability checks and safeguards
  • Contribute to CI/CD pipelines (GitHub Actions) with a focus on system stability
  • Develop tools for:
    • Incident response
    • Debugging
    • Capacity planning


Security, Compliance & Operational Maturity

  • Partner with security and compliance to ensure systems meet operational standards
  • Support audit readiness and reliability-related compliance requirements (Vanta)
  • Integrate monitoring and alerting into security and SIEM workflows
  • Help mature operational practices across the engineering team


Environment & Technical Context

You’ll work across a modern distributed stack:

  • Cloud: AWS (ECS, Lambda, RDS Aurora Postgres, CloudWatch)
  • Infrastructure: Terraform, Ansible, Linux
  • CI/CD: GitHub Actions
  • Observability: Datadog, Grafana, CloudWatch, OpenTelemetry, Sentry, pganalyze
  • Data Systems: Postgres, ClickHouse
  • Security & Compliance: Vanta, SIEM tooling
  • Product & Analytics: Amplitude
  • ML/Platform Infra: TrueFoundry


What Success Looks Like

  • Clear, enforced SLOs and error budgets across critical systems
  • Incidents are well-managed, rare, and decrease over time
  • Engineers have high-confidence signals about system health
  • Alerts are actionable, not noisy
  • Systems scale predictably under load without degradation
  • Postmortems lead to real, measurable improvements
  • Reliability is treated as a shared engineering responsibility, not a reactive function


Ideal Background

Must Have
  • 5+ years in Site Reliability Engineering, SRE-adjacent roles, or production infrastructure
  • Strong experience operating and debugging distributed systems in production
  • Hands-on experience with:
    • Observability tooling (Datadog, Grafana, OpenTelemetry, etc.)
    • Incident response and on-call practices
    • Performance and reliability debugging
  • Experience defining and working with SLOs / SLIs / error budgets
  • Familiarity with:
    • AWS environments
    • Serverless and container-based architectures
    • Postgres or similar relational databases
  • Ability to write code/scripts (Python, Bash, etc.) for automation and tooling
  • Strong systems thinking and ability to reason about failure modes


Nice to Have

  • Experience in high-growth or high-scale environments
  • Background in regulated industries (healthcare, fintech)
  • Experience with ClickHouse or analytical systems at scale
  • Familiarity with chaos engineering or load testing frameworks
  • Exposure to ML infrastructure or data platforms


Plenful perks
  • Comprehensive Benefits Package: Enjoy unlimited PTO, fully covered health insurance (medical, dental, and vision), meal stipend, health & wellness stipend, 401(k) matching, and stock options.
  • Mission-Driven, World-Class Team: Join an exceptional group of professionals aligned around a meaningful mission and committed to making an impact.
  • Opportunities for Growth: Strengthen your partnership expertise through collaboration with experienced, high-performing leaders across the organization.
  • Flexible Work Environment: Employees based in the Bay Area enjoy two days per week in a brand-new downtown San Francisco office. Employees based in other cities enjoy a fully remote work environment with the ability to travel for collaboration.

Skills Required

  • 5+ years of professional engineering experience in a B2B, SaaS company
  • Strong experience operating production systems in cloud environments, ideally AWS
  • Hands-on experience with serverless compute patterns, containerized services, distributed workflows and Postgres
  • Solid understanding of observability tooling, performance debugging and system behavior under load

Plenful Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Plenful and has not been reviewed or approved by Plenful.

  • Healthcare Strength Health coverage is described as comprehensive medical, dental, and vision for employees and dependents.
  • Retirement Support Retirement planning is supported through a 401(k) plan with company matching.
  • Wellbeing & Lifestyle Benefits Lifestyle perks include a daily lunch stipend (for remote and in-office), wellness benefits, and multiple annual team offsites.

Plenful Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
San Francisco, CA
43 Employees

What We Do

Plenful is a healthcare AI-powered workflow automation platform for pharmacy and healthcare operations, streamlining manual and administrative tasks to reduce costs and drive revenue. Trusted by leading pharmacy and healthcare teams, Plenful provides highly configurable automation solutions for 340B auditing and savings identification, document data entry, inventory management, and other high ROI use-cases. Backed by Bessemer Venture Partners, TQ Ventures, Mitch Rales (Cofounder & Chairman of Danaher), Susa Ventures, Waterline Ventures, and other leading healthcare and software investors

Similar Jobs

Superhuman Logo Superhuman

Site Reliability Engineer

Artificial Intelligence • Information Technology • Machine Learning • Natural Language Processing • Productivity • Software • Generative AI
Hybrid
San Francisco, CA, USA
1500 Employees
214K-260K Annually

BAE Systems, Inc. Logo BAE Systems, Inc.

Site Reliability Engineer

Aerospace • Hardware • Information Technology • Security • Software • Cybersecurity • Defense
Hybrid
San Diego, CA, USA
40000 Employees
133K-226K Annually

Zscaler Logo Zscaler

Site Reliability Engineer

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Hybrid
San Jose, CA, USA
8697 Employees
119K-170K Annually

Navan Logo Navan

Site Reliability Engineer

Fintech • Information Technology • Payments • Productivity • Software • Travel • Automation
Easy Apply
Hybrid
Palo Alto, CA, USA
3300 Employees
86K-192K Annually

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account