Staff Software Engineer, Infrastructure

Posted Yesterday
Hiring Remotely in USA
Remote
Senior level
Healthtech • Insurance • Financial Services
The Role
Lead platform reliability: define SLOs/error budgets, own observability and deploy pipelines, harden integrations with dental systems, operate LLM-driven workflows safely, build incident practices, and raise engineering reliability across the company.
Summary Generated by Built In

About Wisdom

Wisdom blends industry expertise with advanced technology to make dental practices work better for everyone involved. We believe dentistry is about people, and we exist to make the future of dentistry stronger and more sustainable for dentists, their teams, and the patients they serve. We match administrative teams with expert billers and custom-built technology to take on the heavy lifting of dental billing while maximizing dentists’ time in-office, and their bottom line.

Coming from a fresh $21M Series A round of funding we are looking for exceptional candidates to help us build a category-defining company. Wisdom has employees across the US.

About The Role

The roadmap isn't handed to you here. You'll help write it — and you'll be the reason it stays up.

As a Staff Software Engineer focused on Infrastructure at Wisdom, you'll set the technical direction for reliability across the company — and own the systems behind the systems: the deploy pipeline, the observability, the capacity controls, and the failure-handling that decide whether our agentic billing infrastructure quietly does its job or pages someone at 2am. This is a force-multiplier role on a small, high-trust team. Your job isn't just to fix what breaks; it's to make the whole organization operate at a higher reliability bar — to build the practices, the guardrails, and the instincts that mean fewer things break in the first place, and the team can handle the ones that do without you in the room.

Wisdom's stack is TypeScript, Node.js, React, Postgres, and AWS, with LLM-driven agents (Mastra, Anthropic) making high-stakes billing decisions in production. The problems we're solving — keeping inconsistent insurance integrations alive, making AI pipelines fail safe instead of failing loud, running HIPAA-compliant infrastructure that genuinely can't go down — are legitimately hard. We'd rather have someone energized by making things not break than someone who merely tolerates being paged when they do.

In your first year, you'll have defined what reliability means at Wisdom and built the function to deliver it: a real observability and SLO practice, an incident process that runs without heroics, agentic pipelines that degrade gracefully instead of taking prod down with them, and a team that's measurably better at operating production because of how you've raised the bar. This is a fully remote role reporting directly to the Head of Engineering.

What You'll Own
  • Set the reliability strategy for the platform — SLOs, error budgets, and the operating standards for services that bill real money for real practices, and the technical roadmap to get us there

  • Own observability end-to-end — tracing, metrics, logging, and alerting (Datadog) that surfaces problems before users do, not after — and make it the default so any engineer can lead an incident, not just the person who wrote the code

  • Define how we operate AI-powered agentic workflows at scale — retries and backpressure, idempotency, graceful degradation, and capacity controls for LLM-driven pipelines. The failure modes here are new (batch blowups, stream drops, runaway cost, model misbehavior); you'll be writing the playbook the rest of the industry hasn't written yet, and setting the patterns the team builds against

  • Harden the integration surface with dental insurance carriers and practice management systems (Dentrix, Eaglesoft) — poorly documented, inconsistent, and the first thing to buckle under load

  • Own deploy and release engineering — fast, safe, reversible deploys; infrastructure as code (Terraform); and the unglamorous discipline that lets a Series A ship many times a day without breaking things

  • Build the incident practice, not just lead incidents — the on-call rotation, the runbooks, the blameless post-incident culture, and the follow-up discipline that turns outages into permanent fixes the whole team owns

  • Raise the bar through others — set technical standards via code review, architecture guidance, and documentation that actually gets used, and level up how the entire engineering team reasons about reliability

  • Take on the ambiguous, undefined, company-level reliability problems and drive them to resolution without waiting for permission or a perfect brief

Who You Are
  • 8+ years running production systems, with a track record of operating at staff/principal scope — you've owned reliability for systems where downtime had real consequences and left them measurably better

  • You've operated at scale under pressure — services that had to stay up, incidents you led to resolution, and reliability practices you established that outlived your tenure and changed how teams worked

  • You multiply the people around you — your impact shows up in what others ship reliably, not only in what you touch directly. You've set standards, mentored engineers, and driven technical decisions across teams without needing the authority to mandate them

  • Deep AWS (or GCP) experience — you've deployed, operated, and debugged distributed services in production, and can reason from first principles when the runbook runs out

  • Strong with infrastructure as code (Terraform), containers and orchestration (ECS/Kubernetes), and CI/CD — the deploy path is yours to make boring

  • Hands-on production experience operating at least one major LLM API — OpenAI, Anthropic (Claude), or Google Vertex AI — with a focus on the operational reality: rate limits, retries, latency, cost, and what happens when the model misbehaves in a live system

  • Strong command of TypeScript/JavaScript — you can read and fix the application code, not just the infra around it; Python or Go a plus

  • Deep experience with relational databases — connection management, query performance, and reasoning about data integrity under load

  • You default to ownership and move toward the pager, not away from it

  • You're direct, intellectually honest, and collaborative — you surface bad news early, change your mind when the evidence warrants it, and write the postmortem that makes the whole team sharper

You'll Stand Out If You Have
  • Experience operating LLM / agentic systems in production — or with frameworks like Mastra, LangChain, LlamaIndex, or CrewAI — where reliability, cost, and latency were yours to define and manage

  • Working knowledge of HIPAA compliance and what it means to run infrastructure responsibly in a healthcare context

  • Experience at a Series A or early-stage startup where you built the reliability function from scratch rather than inheriting one

Wisdom is an equal opportunity employer. We provide employment opportunities without regard to age, race, color, ancestry, national origin, religion, disability, sex, gender identity or expression, sexual orientation, veteran status, or any other protected status in accordance with applicable law.

Skills Required

  • 8+ years running production systems with staff/principal scope
  • Proven experience operating high-availability services and leading incident response
  • Deep AWS or GCP experience deploying, operating, and debugging distributed services
  • Infrastructure as Code using Terraform
  • Containers and orchestration (ECS or Kubernetes)
  • CI/CD and release engineering (fast, reversible deploys)
  • Observability and alerting experience (tracing, metrics, logging) — Datadog experience
  • Hands-on production experience operating at least one major LLM API (OpenAI, Anthropic, Google Vertex AI, Mastra)
  • Strong command of TypeScript/JavaScript and ability to read/fix application code
  • Deep experience with relational databases (Postgres) including connection management and performance under load
  • Experience defining SLOs, error budgets, and reliability operating standards
  • Ability to mentor, set standards, and influence cross-team technical decisions
  • Hands-on experience with Docker and containerized deployments
  • Experience with HIPAA compliance and running infrastructure responsibly in healthcare contexts
  • Experience operating LLM/agentic systems in production or frameworks like Mastra, LangChain, LlamaIndex, CrewAI
  • Experience at a Series A or early-stage startup building reliability functions from scratch
  • Python or Go programming experience
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: New York, NY
43 Employees
Year Founded: 2023

What We Do

Wisdom is an innovative full-service billing company that allows dental practices to outsource their insurance collections, insurance verification and patient billing processes. We leverage a unique combination of AI technology, proprietary data, and highly trained billing professionals to deliver the most efficient and effective billing service possible. We empower practices by taking away the pain of dealing with insurances and providing professional revenue management so that dental teams can focus on delivering awesome dentistry.

Similar Jobs

NBCUniversal Logo NBCUniversal

Staff Software Engineer

AdTech • Cloud • Digital Media • Information Technology • News + Entertainment • App development
Remote or Hybrid
New York, NY, USA
68000 Employees
150K-170K Annually

Coinbase Logo Coinbase

Staff Software Engineer

Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Easy Apply
Remote
USA
4700 Employees
218K-257K Annually

CoreWeave Logo CoreWeave

Staff Software Engineer

Cloud • Information Technology • Machine Learning
In-Office or Remote
5 Locations
1450 Employees
188K-275K Annually

Lambda Logo Lambda

Staff Software Engineer

Artificial Intelligence • Cloud • Machine Learning • Infrastructure as a Service (IaaS)
Remote or Hybrid
3 Locations
750 Employees
314K-465K Annually

Similar Companies Hiring

Granted Thumbnail
Mobile • Insurance • Healthtech • Financial Services • Artificial Intelligence
New York, New York
23 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account