Human Data Evals Lead (Remote/US/LATAM)

Posted 5 Days Ago
Be an Early Applicant
2 Locations
Remote
Senior level
Artificial Intelligence • Edtech • Machine Learning • Professional Services
The Role
Lead Anyone AI's data initiatives: create proposals, design frontier-grade sample packages and benchmarks, recruit and calibrate subject-matter experts, manage lab relationships, and deliver pilots end-to-end with rigorous QC to convert pilots into production.
Summary Generated by Built In

Reports to: CEO

Owns: data proposals, sample development, quality, and pilot delivery

Location: Remote / Latam / US


The role

You will own Anyone AI’s data initiatives and proposals to AI labs, from the data proposal or responding to requests, through pilot delivery. You own how we build proposals and develop the sample packages and benchmarks: frontier-grade packages across reasoning, coding, agents, and tool use, multi-modal and others, produced in collaboration with subject-matter experts, with expert-verified ground truth, multi-model headroom results, and QC that survives buyer-side scrutiny. You are the person who designs the sample that demonstrates our quality, converts pilots into production engagements. On a small team, this is the operational center of the Human Data Division.

Responsibilities
  • Proposals & requests. Study public benchmarks and eval targets, and turn them into proposals and sample packages that demonstrate capability and win the work. Respond to lab data requests and pilots.

  • Sample & benchmark development. Design and build the sample packages, working with subject-matter experts. Every package meets the bar of our current sample set:

    • Expert-verified, exact-match-checkable ground truth and gold reasoning trajectories.

    • Multi-model evaluation showing real headroom, and proof the task discriminates the model, not just that it's hard.

    • Rigorous QC structure: calibration layers, severity-weighted rubrics, deterministic verifiers, evidence maps, etc.

  • Subject-matter experts. Recruit, brief, calibrate, and review a pool of experts across coding, agentic/tool-use, and STEM/reasoning. Raise their output to our standard and keep it there; be the arbiter of what "correct" and "frontier-difficulty" mean.

  • Lab relationships. Be a direct point of contact for lab partners on Slack and calls, with support from the CEO and the wider team. Keep senior lab contacts informed, surface what they actually need, and pull in the CEO and subject-matter experts when the conversation calls for it.

  • Pilot delivery. Own pilots end to end: scoping, SOW, staffing, production, QC, and delivery. Nothing ships before it's lab-ready, and nothing comes back rejected as "not frontier-level" without us already knowing why.

Experience
  • Originated data or benchmark proposals for AI labs, translated eval targets into sample tasks that demonstrate capability, and owned the engagement through delivery.

  • Deep evaluation and quality expertise: LLM benchmarking, with real strength in code-model evaluation.

  • Built QC processes and artifact standards that met enterprise or lab requirements, and set a quality bar a team of experts was held to.

  • Thrives in ambiguous, fast-moving environments where the rules are still being written, and delivers under pressure.

Qualifications

  • 5+ years in technical delivery, quality, or program management, with recent experience in AI/ML data, model evaluation, or benchmarking.

  • Hands-on experience delivering data or evaluation work to AI labs or enterprise ML teams, scoping through delivery.

  • Working fluency with how frontier models are evaluated: benchmarks, rubrics, pass rates, headroom, and what makes a task discriminate a model.

  • Proven people/vendor leadership, you've recruited, calibrated, and held a team or expert pool to a quality standard.

  • Fluent English. Spanish is a nice to have.

Skills Required

  • Originated data or benchmark proposals for AI labs and owned engagements through delivery
  • Deep evaluation and quality expertise in LLM benchmarking, especially code-model evaluation
  • Built QC processes and artifact standards meeting enterprise or lab requirements
  • 5+ years in technical delivery, quality, or program management with recent AI/ML data or model evaluation experience
  • Hands-on experience scoping and delivering data/evaluation work to AI labs or enterprise ML teams
  • Working fluency with frontier model evaluation concepts: benchmarks, rubrics, pass rates, headroom, discrimination
  • Proven people/vendor leadership: recruiting, calibrating, and holding experts to a quality standard
  • Fluent English
  • Spanish
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
0 Employees
Year Founded: 2022

What We Do

Anyone AI is an edtech startup dedicated to bridging the AI talent gap by investing in software developers from Latin America. The company provides intensive, hands-on training programs in Machine Learning and Artificial Intelligence, led by industry experts. By combining technical skill development with employability support, Anyone AI prepares professionals for global career opportunities, helping them transition into high-impact roles within the rapidly evolving AI and technology sectors.

Similar Jobs

Mondelēz International Logo Mondelēz International

Analytics Manager

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
6 Locations
90000 Employees

Circle (circle.so) Logo Circle (circle.so)

Lead Product Designer

Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
Easy Apply
Remote
31 Locations
250 Employees
140K-170K Annually

Luxury Presence Logo Luxury Presence

Staff Software Engineer

Marketing Tech • Real Estate • Software • PropTech • SEO
Easy Apply
Remote or Hybrid
12 Locations
500 Employees

Adyen Logo Adyen

Operations Manager

Fintech • Payments • Financial Services
Easy Apply
Remote or Hybrid
11 Locations
4771 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account