Staff Site Reliability Engineer

Posted 2 Days Ago
Hiring Remotely in USA
Remote
140K-170K Annually
Senior level
Artificial Intelligence • Healthtech • Software • Telehealth
Solving healthcare capacity. Powering boundless, seamless care experiences. Streamlining virtual & in-person care.
The Role
The Staff Site Reliability Engineer will architect and manage AWS and Kubernetes infrastructure, focusing on automation, observability, and compliance in healthcare operations.
Summary Generated by Built In
About Fabric Health
At Fabric Health, we are powering boundless care by solving healthcare’s biggest challenge: clinical capacity. We aren’t here to disrupt healthcare; we’re here to fix it. We unify the care journey from intake to treatment, using intelligent automation to remove administrative burdens and make care delivery 2-10x more efficient. Our technology empowers clinicians to move faster and focus on what matters most: the patient.

We are a mission-driven team of brilliant minds trusted by leading organizations including Intermountain Health, OSF HealthCare, SSM Health, and MUSC Health. Our vision is backed by premier investors such as Thrive Capital, GV (Google Ventures), General Catalyst, and Salesforce Ventures. We move quickly for good reason, listen deeply to solve big challenges, and build products with the same care and quality we’d want for our own loved ones.

Learn more: About Us | News & Press | LinkedIn | Careers

About the Role
As a Staff Site Reliability Engineer, you will own and evolve the infrastructure powering healthcare experiences for millions of patients. This role bridges the gap between traditional infrastructure excellence and the future of AI-driven operations. You will act as a primary architect for our AWS and Kubernetes (EKS) environment, ensuring the platform is resilient, scalable, and compliant while exploring how agentic workflows can modernize SRE practices.

What You'll Do
As a Staff Site Reliability Engineer, you will be a steward of Fabric’s production integrity, leading the strategy for infrastructure automation, observability, and system resilience. Your primary responsibilities include:

  • Infrastructure & Kubernetes Orchestration
    • Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users.
    • Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform.
    • Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost-efficiency and reliability.
  • AI-Assisted Operations & Automation
    • Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks.
    • Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe.
    • Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems.
  • Observability & Incident Management
    • Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs.
    • Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR).
    • Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards.
  • Compliance & Collaboration
    • Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements.
    • Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews.

Why You Might Be a Good Fit
  • You are a deeply proficient engineer who excels at the intersection of cloud infrastructure, automation, and system design.
  • You possess a meticulous approach to observability and a passion for finding the "root cause" rather than just applying a patch.
  • You enjoy exploring the "next frontier" of SRE, including how AI and agentic tools can make operations more efficient.
  • You thrive in fast-paced environments where technical rigor is balanced with pragmatism and clinical-grade safety.

This Might Not Be The Right Fit If...
  • You prefer working on static infrastructure rather than evolving systems through code and automation.
  • You are uncomfortable with the "agile" pace of tech-driven platform development or integrating AI tools into your daily workflow.
  • You prefer a siloed role that does not involve active participation in incident response or collaborative postmortems.

Your Qualifications
  • 8+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale.
  • Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management.
  • Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems.
  • Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go.
  • Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency.
  • A "rigor-first" mindset with a dedication to HIPAA-compliant, high-availability architecture.

The national pay range for this role is $140,000.00 – $170,000.00 per year. Actual compensation will be determined by factors such as the candidate's geographic market, experience, skills, and qualifications. Certain roles may also be eligible for additional compensation, including a comprehensive benefits package such as medical, dental, vision, unlimited PTO, and a 401(k) plan, stock options and bonuses. If your compensation requirement is greater than our posted range, please still consider applying; a determination can be made based on unique qualifications. Expected compensation ranges for this role may change over time.At Fabric, we believe that a diverse workforce is essential to our success. We are an equal opportunity employer and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, color, religion, sex, national origin, age, disability, veteran status, or any other legally protected characteristic. We actively encourage individuals from all backgrounds to apply.

Recruitment Fraud Alert: Protect Yourself
Fabric Health is aware of scammers attempting to impersonate employers. To ensure that any recruiting contact you receive is legitimate, please adhere to the following:

  • Verify the Domain: Official recruitment emails will only come from addresses ending in @fabrichealth.com or @gem.com. No other domain names are legitimate.
  • Official Interview Tools: We use Gem for our recruitment process and Google Meet for all video interviews. Google Meet is always the platform used for your first interview; you will never be sent a Zoom link to set up or conduct an initial interview. All interviews are conducted via video unless specifically stated by our team as an audio call. We never conduct interviews via chat, social media, Skype, or WhatsApp.
  • Zoom Usage: Zoom is utilized only for specific meetings set directly by our team for purposes outside of the standard interview process (e.g., coordination or onboarding discussions). It is never the first link you will receive from us.
  • Authorized Contact & Texting: Fabric will only contact you if you have submitted an application or if you are connected to a current employee who shared your information with us. We will only send text messages if you have provided explicit authorization and consent, either through your application or while communicating directly with our team. If you have not explicitly authorized us to reach out, treat any SMS or unsolicited outreach as fraudulent and do not respond.
  • Sensitive Data: We will never ask you for sensitive personal or financial documents (ID, banking info, SSN) during the application, interview, or candidacy stages. All sensitive data is handled through secure internal systems post-offer.
  • Verify the Team: You can reference LinkedIn to verify members of our recruiting team; however, please remain vigilant as scammers may create fraudulent profiles. Always cross-reference the sender's email domain with our official @fabrichealth.com address.

If you question the validity of a contact or receive a suspicious message, do not click any links. Report the issue immediately to [email protected].

Please note: The security inbox is for reporting fraudulent activity only. Do not email this address for application status updates or to share application materials, as these will not be reviewed. Applications are only accepted and reviewed if submitted through our official application portal, and no application status information will be provided via the security email. 

Top Skills

AWS
Bash
Ci/Cd Systems
Datadog
Eks
Github Actions
Go
Kubernetes
Python
Ruby
Semaphore
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: New York, NY
304 Employees

What We Do

Fabric Health is a pioneering care access platform dedicated to solving one of healthcare's most pressing challenges: operational capacity. We equip health systems, health plans, employers, and brokers with consumer-grade solutions designed to make healthcare faster, smarter, and more accessible at the exact moment of need. Our core mission is to empower boundless care through seamless, intuitive experiences that benefit both patients and providers. At the heart of our offering is a commitment to streamlining the entire care journey. We achieve this by integrating and optimizing every touchpoint from initial intake and intelligent triage to precise routing and effective treatment, encompassing both virtual and in-person care modalities. Our platform is meticulously crafted to enhance operational efficiency, addressing the complex logistical problems that often hinder timely and quality care delivery. What makes Fabric Health uniquely effective is our deep understanding of the patient and clinician experience. We develop robust automation workflows that significantly reduce administrative burdens on healthcare professionals, allowing them to focus invaluable time on what matters most: direct patient care. By automating repetitive tasks and streamlining processes, we enable clinicians to work more efficiently, leading to reduced burnout and a higher quality of service. Our comprehensive suite of features and services includes an advanced AI Assistant that guides users, dynamic Engagement & Pathways to ensure continuity of care, a versatile Virtual Care Platform for remote consultations, and dedicated Virtual Care Services for immediate support. Furthermore, our Intake & Care Guides simplify initial patient interactions, while our Enterprise Features provide scalable solutions for large organizations. Ultimately, Fabric Health drives tangible outcomes: reduced care costs, expanded access to care, and the delivery of consistent, higher-quality patient experiences. We are relentlessly focused on creating better, more intuitive interactions for consumers and equipping providers with the tools they need to deliver exceptional care efficiently. Every solution we develop is a testament to our dedication to transforming healthcare delivery, ensuring that timely, compassionate, and effective care is within everyone's reach.

Why Work With Us

Discover a culture of innovation, collaboration, and continuous growth. We offer a dynamic, remote-first environment where talented individuals thrive by tackling meaningful challenges. Contribute to cutting-edge solutions, work alongside inspiring peers, and truly make a difference in a flexible setting. Grow your career with purpose.

Similar Jobs

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
4 Locations
5550 Employees
127K-249K Annually

NBCUniversal Logo NBCUniversal

Staff Software Engineer

AdTech • Cloud • Digital Media • Information Technology • News + Entertainment • App development
Remote or Hybrid
New York, NY, USA
68000 Employees
130K-170K Annually

Jellyfish Logo Jellyfish

Site Reliability Engineer

Big Data • Cloud • Productivity • Software • Database • Analytics • Automation
Remote or Hybrid
United States
225 Employees
165K-235K Annually

Unify (unifygtm.com) Logo Unify (unifygtm.com)

Site Reliability Engineer

Artificial Intelligence • Software
In-Office or Remote
2 Locations
64 Employees
250K-295K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account