Eval Engineer

Reposted 2 Days Ago
2 Locations
In-Office
Entry level
Blockchain • Web3
The Role
The Eval Engineer designs and runs evaluations of emerging AI technologies, builds evaluation frameworks, analyzes results, and publishes findings for the developer community.
Summary Generated by Built In
About the company

Braintrust is the AI observability platform. By connecting evals and observability in one workflow, Braintrust gives builders the visibility to understand how AI behaves in production and the tools to improve it.

Teams at Notion, Stripe, Zapier, Vercel, and Ramp use Braintrust to compare models, test prompts, and catch regressions — turning production data into better AI with every release.

About the role

We’re hiring an Eval Engineer to design and run creative evaluations of new AI capabilities. Your job is to turn emerging AI ideas into measurable experiments and publish the results for the developer ecosystem.

When new models, agents, or frameworks appear, everyone has opinions about what works but few people actually test them. This role exists to change that.

You’ll design experiments that compare models, prompts, and agent architectures against real tasks. You’ll build the datasets, scoring logic, and evaluation harnesses. Then you’ll publish the results so builders understand what actually works.

This role sits at the intersection of engineering, experimentation, and technical storytelling.

What you’ll ownIndustry evals
  • Design and run evaluations of new AI capabilities

  • Compare frontier models, agent systems, and tool workflows

  • Turn emerging ideas into measurable benchmarks

Eval design
  • Define datasets, tasks, and scoring logic for experiments

  • Design realistic workloads that reflect production environments

  • Create tests that expose failure modes and edge cases

Experiment implementation
  • Build evaluation harnesses using Braintrust

  • Run comparisons across models, prompts, and agent approaches

  • Analyze traces, outputs, and failure patterns

Creative test construction
  • Invent novel ways to stress test AI systems

  • Design scenarios that break agents, prompts, and model reasoning

  • Build adversarial or complex datasets that reveal weaknesses

Technical content
  • Write technical posts explaining evaluation methodology and results

  • Share datasets and scoring logic so experiments are reproducible

  • Help establish better evaluation patterns for the industry via courses

Evaluation playbooks
  • Develop reusable eval patterns for agents, RAG systems, and LLM apps

  • Create open source reference implementations developers can adopt

  • Contribute examples and guides that help teams build better evals

What great looks like
  • You’re an engineer who likes testing systems more than building features

  • You enjoy breaking things and understanding why they fail

  • You can design experiments that isolate meaningful differences between approaches

  • You understand how LLMs, agents, and RAG systems actually work

  • You write clearly for technical audiences

  • You ship experiments quickly and iterate often

  • You care about methodology and reproducibility

  • You’re curious, creative, and opinionated about how AI should be evaluated

What you’ve done
  • Built or contributed to evaluation systems for LLM or agent applications

  • Designed experiments comparing models, prompts, or AI architectures

  • Written Python code to run tests across models or APIs

  • Built datasets or scoring logic for AI quality measurement

  • Investigated model failures or unexpected behaviors

  • Published technical blog posts, research notes, or engineering write-ups

  • Built prototypes quickly to test ideas

If you want to help the industry understand how to measure AI systems and design the evaluations everyone else learns from, this is the role.

Benefits include
  • Medical, dental, and vision insurance

  • Daily lunch, snacks, and beverages

  • Flexible time off

  • Competitive salary and equity

  • AI Stipend

Equal opportunity

Braintrust is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Top Skills

Python
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
241 Employees
Year Founded: 2018

What We Do

Braintrust is the first decentralized Web3 talent network that connects skilled, vetted knowledge workers with the world’s leading companies. The community that relies on Braintrust to find work are the same people who own and build it, ensuring the network always serves the needs of its users, instead of a centrally-controlled corporation. And because the community of knowledge workers and contributors earns ownership and control of Braintrust through its native BTRST token for their contributions to the network and its growth, new Talent and jobs have participated in the network at record speeds. Braintrust has over 700,000+ community members, with knowledge workers and project contributors across the world. Braintrust is trusted by hundreds of Fortune 1000 global enterprises including Nestlé, Porsche, Atlassian, Goldman Sachs, and Nike. For more information, visit: www.braintrust.com. BTRST is available on Coinbase.com and in the Coinbase Android and iOS apps. Coinbase customers can trade, send, receive, or store BTRST in most Coinbase-supported regions. For more information on Braintrust and the BTRST token, read the “Braintrust: The Decentralized Talent Network” whitepaper.

Similar Jobs

In-Office
3 Locations
2359 Employees
213K-263K Annually

Rhymetec Logo Rhymetec

Administrative Assistant

Cloud • Information Technology • Consulting • Cybersecurity • Data Privacy
Easy Apply
In-Office or Remote
New York City, NY, USA
33 Employees

Luxury Presence Logo Luxury Presence

Senior Data Engineer

Marketing Tech • Real Estate • Software • PropTech • SEO
Easy Apply
Remote or Hybrid
United States
500 Employees
150K-190K Annually

PwC Logo PwC

Data Architect

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
60 Locations
370000 Employees
124K-280K Annually

Similar Companies Hiring

Bitnomial Thumbnail
Web3 • Software • Fintech • Financial Services • Cryptocurrency • Blockchain
Chicago, IL
26 Employees
Block Thumbnail
Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency
Oakland, CA
12000 Employees
Rain Thumbnail
Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3 • Infrastructure as a Service (IaaS)
New York, NY
100 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account