Staff Site Reliability Engineer

Reposted 17 Days Ago
Be an Early Applicant
Mountain View, CA, USA
In-Office
252K-308K Annually
Senior level
Fintech • Payments • Financial Services
The Role
Lead EarnIn's AI-first reliability engineering, enhancing incident response, automation, and resilience in operations while mentoring engineers.
Summary Generated by Built In
About EarnIn

As one of the first pioneers of earned wage access, our passion at EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. Our community members access their earnings as they earn them, with options to spend, save, and grow their money without mandatory fees, interest rates, or credit checks.

We’re fortunate to have an incredibly experienced leadership team, combined with world-class funding partners like A16Z, Matrix Partners, DST, Ribbit Capital, and a very healthy core business with a tremendous runway. We’re growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of our growth journey.

WHY this role exists

EarnIn’s products must deliver speed, reliability, resilience, and trust to community members who depend on them. As EarnIn grows, we cannot rely on heroics, tribal knowledge, manual investigation, or isolated SRE expertise. We must embed reliability practices that scale across product engineering teams, enhance customer experience, and enable rapid shipping without increasing operational risk. This role exists to lead EarnIn’s next stage of reliability maturity: an AI-first operating model that uses AI to actively detect, investigate, respond to, learn from, and prevent production issues. As a Staff Site Reliability Engineer, you will guide technical direction for reliability across critical services, relying on AI-assisted workflows as key tools to reduce toil, speed incident response, improve production readiness, and enhance the operational quality of the engineering organization.

The base salary range for this full-time position is $252,000-$308,000, plus equity and benefits. Our salary ranges are determined by role, level, and location. This is a hybrid position in Mountain View (Headquarters) and will require in-office work 2 days a week.

HOW you will create impact

  • Act as a Staff-level technical leader: define standards, architect solutions, mentor engineers, influence cross-team efforts, and construct reusable systems and practices that multiply your impact.
  • You will embed AI-first thinking into reliability practices, leveraging AI to streamline alert triage, accelerate incident investigation, automate runbooks, retrieve operational knowledge, enhance postmortem quality, track corrective actions, quantify reliability with scorecards, detect capacity risks, and analyze architectural risks.
  • You will maintain human ownership and engineering judgment at the center of operations. AI aids engineers by speeding context gathering, clarifying reasoning, and reducing repetition, but it does not replace accountability.
  • Collaborate with SRE, product engineering, infrastructure, security, and leadership teams to embed reliability, making it easy to adopt and impossible to ignore.

WHAT you will own

Reliability strategy and standards

  • Define and evolve reliability standards across critical services, including SLIs, SLOs, error budgets, production readiness, observability, incident response, and resilience patterns.
  • Establish a reliability operating model that clarifies service ownership, operational expectations, and decision-making around reliability tradeoffs for product engineering teams.
  • Use AI-assisted analysis to interpret reliability trends, detect weak operational signals, highlight capacity risks using pattern recognition, and generate actionable reliability scorecards for teams, clearly delineating where AI automates data gathering and insight generation.

AI-first incident response and operational workflows

  • Overhaul key stages of the incident lifecycle to achieve faster detection, sharper triage, richer context retrieval, clearer communication, and stronger follow-through.
  • Command high-severity incidents as Incident Commander and reinforce the systems, tools, and practices that simplify incident management.
  • Design and implement workflows in which AI assists with alert correlation, signal enrichment, root-cause exploration, runbook retrieval, postmortem drafting, and corrective-action tracking.
  • Ensure AI-assisted incident workflows remain reviewable, auditable, and safe by requiring human verification at all critical steps and maintaining clear operational ownership with humans accountable for final decisions.

On-call quality and toil reduction

  • Elevate on-call quality by silencing noisy alerts, automating repetitive investigations, and enabling responders to rapidly digest service context.
  • Build tools that gather context from systems like Datadog, CloudWatch, incident.io, Slack, runbooks, deployment history, and service metadata.
  • Transition teams from reactive paging to proactive reliability enhancement.

Architecture and resilience

  • Steer service designs for graceful degradation, failure isolation, robust capacity planning, and operational safety throughout EarnIn’s AWS environment.
  • Apply production data, incident learnings, and AI analysis to spot architectural risks before they recur.
  • Instruct engineering teams to embed reliability expectations into design reviews, launch protocols, and service evolution.

Mentorship and cross-org influence

  • Coach engineers in reliability practices, incident response, SLOs, observability, production debugging, and AI-assisted operational workflows.
  • Direct design reviews, incident reviews, and operational maturity discussions to improve engineering judgment across teams.
  • Produce documentation, tooling, and reusable patterns that unlock reliability knowledge and enable action.

WHAT you'll do

  • Set a reliability strategy with AI at the center. Define SLIs, SLOs, and error budgets across critical services. Use AI to surface trends, predict capacity risks, and auto-generate reliability scorecards so teams act on data.
  • Redesign the incident lifecycle around AI-assisted speed. Lead high-severity incident response as IC. Build AI-driven alert correlation and triage that reduces noise and accelerates root-cause identification. Drive adoption of AI-generated postmortems that surface systemic patterns and automatically track corrective actions through to completion.
  • Improve on-call fundamentally better through automation. Build AI agents that draft runbook responses, pull relevant context from Datadog, incident.io, and Slack during pages, and recommend remediation steps, so on-call engineers spend less time deciding and searching.
  • Push AI-first operations into product engineering teams. Partner with product engineering to embed AI-assisted investigation, alerting, and production readiness into their workflows. Make AI tooling the default path for every team that owns a service, not an SRE-only capability.
  • Architect for resilience at scale. Guide service designs for graceful degradation, failure isolation, and capacity planning across EarnIn's AWS footprint (EKS, Kafka, DynamoDB, RDS, SQS). Use AI-driven analysis to identify architectural weak points before they become incidents.
  • Raise the bar through mentorship and standards. Coach engineers on reliability practices, run design and incident reviews, and build documentation and tooling that makes reliability knowledge accessible. Set the expectation that AI-assisted workflows are how EarnIn operates, not an experiment.

WHAT we're looking for

  • 7+ years in SRE, Software Engineering, or Infrastructure Engineering with increasing scope and cross-org influence. Track record of KPI driven reliability and operational excellence improvements at scale.
  • Demonstrated experience improving reliability and operational excellence at scale using clear KPIs such as MTTR, MTTD, alert quality, incident recurrence, SLO attainment, on-call health, or corrective-action completion.
  • Shipped experience applying AI/LLMs to engineering or operational workflows, such as alert triage, runbook automation, incident investigation, postmortem drafting, remediation recommendation, operational knowledge retrieval, or agentic operations tooling.
  • Significant expertise with SLIs, SLOs, error budgets, incident command, blameless postmortems, and recurrence prevention in large-scale distributed systems.
  • Strong software engineering ability in Python, Go, or similar languages. You build tools and automation, not just dashboards.
  • Deep observability experience with systems such as Datadog, CloudWatch, OpenTelemetry, or similar platforms, with a bias toward signal-heavy alerting designed for real human response.
  • Strong infrastructure-as-code and cloud infrastructure experience, including Terraform, Kubernetes, AWS, and safe, reversible deployment practices.
  • Practical experience using AI-assisted development tools such as Cursor, Claude Code, Copilot, ChatGPT, or similar tools to accelerate your own engineering work and model effective adoption for partner teams.
  • Experience in fintech, regulated environments, SOC 2, PCI, FinOps, or cost/performance tradeoffs in high-scale systems is a plus.

#LI-Hybrid

At EarnIn, we believe that the best way to build a financial system that works for everyday people is by hiring a team that represents our diverse community. Our team is diverse not only in background and experience but also in perspective. We celebrate our diversity and strive to create a culture of belonging. EarnIn does not unlawfully discriminate based on race, color, religion, sex (including pregnancy, childbirth, breastfeeding, or related medical conditions), gender identity, gender expression, national origin, ancestry, citizenship, age, physical or mental disability, legally protected medical condition, family care status, military or veteran status, marital status, registered domestic partner status, sexual orientation, genetic information, or any other basis protected by local, state, or federal laws. EarnIn is an E-Verify participant. 

EarnIn does not accept unsolicited resumes from individual recruiters or third-party recruiting agencies in response to job postings. No fee will be paid to third parties who submit unsolicited candidates directly to our hiring managers or HR team.

Skills Required

  • 7+ years in SRE, Software Engineering, or Infrastructure Engineering
  • Experience applying AI/LLMs to operational workflows
  • Expertise with SLOs/SLIs, error budgets, incident command, and blameless postmortems
  • Software engineering ability in Python, Go, or similar
  • Experience with Datadog, CloudWatch, OpenTelemetry
  • Infrastructure-as-code proficiency with Terraform, Kubernetes, AWS
  • Experience with AI-assisted development tools
  • Experience in fintech or regulated environments
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Palo Alto, CA
229 Employees
Year Founded: 2012

What We Do

Earnin’s mission is to build a financial system that works for people. Every year, while Americans wait for their paychecks, more than $1 trillion of their hard-earned money is held up in the pay cycle. As a result, we accumulate over $50 billion in late and overdraft fees and turn to high-interest loans. We seek to eliminate those fees and put money back into workers’ hands. Our financial system doesn’t work for people. But Earnin does. Earnin is an app that lets people get paid as soon as they leave work, with no fees, interest, or hidden costs. App users can receive their money in their bank account instantly at little or no cost — as we operate on a pay what you choose model. All they need is a bank account and a job that provides direct deposit or uses electronic timesheets. At Earnin, we’re building the way we think a financial system should work for everyone, not just the people who can afford it. We help people take control of their money and get to a better financial place. Our goal is not only to provide great products at little or no cost to the people who need them but also to inspire kindness across the financial world and eventually across all industries.

Similar Jobs

Domino Data Lab Logo Domino Data Lab

Site Reliability Engineer

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
200K-230K Annually
In-Office
3 Locations
1001 Employees
116K-174K Annually

ServiceNow Logo ServiceNow

Site Reliability Engineer

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Santa Clara, CA, USA
29000 Employees
166K-290K Annually

Sprinter Health Logo Sprinter Health

Site Reliability Engineer

Artificial Intelligence • Healthtech • Logistics • Social Impact • Software • Telehealth
Remote or Hybrid
2 Locations
500 Employees
160K-255K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account