Staff Engineer, AI Evals

Reposted 10 Hours Ago
Be an Early Applicant
2 Locations
In-Office
Senior level
Artificial Intelligence • Software • Generative AI • Automation
The Role
As a Staff Engineer, AI Evals, you will design and operate evaluation systems for AI agents, ensuring quality and reliability through meaningful metrics and feedback loops.
Summary Generated by Built In
The Opportunity

At Sema4.ai, we’re building an Enterprise AI Agent platform that fundamentally changes how knowledge work gets done—by enabling people and AI agents to collaborate in durable, trustworthy ways.

As a Staff Engineer, AI Evals, you’ll design and own the evaluation systems that determine whether our agents are actually good: correct, reliable, efficient, and improving over time. You’ll build the measurement backbone that guides model choice, agent design, product decisions, and customer trust.

This is an early, high-impact role. You’ll be defining how we measure success for AI agents in production, where ambiguity is real, and ground truth can be messy. We’re looking for an engineer who brings rigor, judgment, and strong opinions about what “good” looks like, and who know how to operationalize it.

Who You Are

AI Systems & Evaluation Expert

You understand that AI systems are only as good as the way they’re measured. You’ve worked with LLMs and agentic systems in production and have seen how offline benchmarks, synthetic data, and human judgment can all fail in different ways. You know how to design evaluations that are meaningful, repeatable, and decision-useful, not just theoretically impressive.

You’re familiar with the sharp edges: non-determinism, prompt drift, regression risk, overfitting, data leakage, and the tension between fast iteration and statistical rigor.

In-Depth Technologist

You stay close to research and industry practice in evaluation, alignment, and reliability. You understand where automated metrics work, where they break down, and how to combine them with human review, golden datasets, and production signals. You bring creativity to building evaluation sets and scenarios, and in sourcing (or synthesizing) the data you need.

Builder With High Standards

You care deeply about correctness, clarity, and operational behavior. You can move fast, but you don’t confuse speed and rigor. You design eval systems that engineers trust, product relies on, and leadership uses to make decisions. You know when to build custom infrastructure and when to leverage existing tools without outsourcing critical thinking.

What You’ll Do

Build and Own the Evaluation Platform

Design, build, and operate Sema4.ai’s core evaluation infrastructure for LLMs and agents: offline benchmarks, regression tests, task-level metrics, and production feedback loops. These systems will directly inform product launches, model upgrades, and customer requirements.

Define “Good” for Agents in Production

Work closely with agent, product, and field engineering teams to translate fuzzy goals around correctness, reliability, usefulness into concrete, measurable signals. You’ll help define success criteria for new capabilities and ensure we can detect regressions before customers do.

Tackle Ambiguous, High-Leverage Problems

Solve hard problems where the answer isn’t obvious:

  • How to evaluate long-running, multi-step agents

  • How to balance automated scoring with human judgment

  • How to measure improvement when tasks evolve

  • How to compare models under cost and latency constraints

Influence Technical and Product Direction

Use evaluation results to guide architectural decisions, model selection, and roadmap tradeoffs. You’ll participate in design reviews, set technical standards for eval rigor, mentor other engineers, and help interview senior technical candidates.

What You Bring
  • 7+ years of software engineering experience, including 2+ years building AI/ML systems in production

  • Deep experience with backend systems in Python, including data pipelines, observability, and reliability

  • Hands-on experience evaluating LLM-based systems (agents, retrieval, tool use, workflows, etc.)

  • Strong intuition for metrics, experimentation, and failure analysis in non-deterministic systems

  • Strong communication skills: whether you’re talking to colleagues, customers, or machines, you communicate clearly, concisely, and collaboratively

  • A high-ownership mindset: you care deeply about the integrity of the systems you build and the decisions they inform

Skills Required

  • 7+ years of software engineering experience
  • 2+ years building AI/ML systems in production
  • Deep experience with backend systems in Python
  • Hands-on experience evaluating LLM-based systems
  • Strong intuition for metrics and experimentation
  • Strong communication skills
  • A high-ownership mindset
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
60 Employees
Year Founded: 2024

What We Do

Sema4.ai provides an enterprise AI agent platform that enables organizations to build, run, and manage AI agents at scale, transforming knowledge work and automating complex tasks.

Similar Jobs

Lowe’s Logo Lowe’s

Account Manager

Consumer Web • eCommerce • Information Technology • Retail • Software • Analytics • App development
Hybrid
Canton, GA, USA
300000 Employees

ServiceNow Logo ServiceNow

Consultant

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Atlanta, GA, USA
29000 Employees

ServiceNow Logo ServiceNow

Consultant

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Atlanta, GA, USA
29000 Employees

Square Logo Square

Account Executive

eCommerce • Fintech • Hardware • Payments • Software • Financial Services
Hybrid
Atlanta, GA, USA
12000 Employees
123K-223K Annually

Similar Companies Hiring

Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account