Data Scientist - Evaluations, Chanakya

Posted Yesterday
Be an Early Applicant
Bengaluru, Bengaluru Urban, Karnataka, IND
In-Office
Mid level
Artificial Intelligence • Software
The Role
Design and maintain domain-specific evaluation frameworks for LLMs and AI systems: define quality metrics, run pre/post-deployment evaluation cycles, build dashboards, find failure modes and distribution shifts, operationalize automated eval pipelines with MLOps, manage datasets and annotation workflows, and publish quality reports.
Summary Generated by Built In
About Sarvam

Sarvam is building the bedrock of Sovereign AI for India. The company is developing India's full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India's leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.

 

About the Role

This is an intellectually demanding role. The Data Scientist anchors the evaluations function for this vertical — designing, building, and maintaining evaluation frameworks that measure model and system quality in operational context. You are not running standard benchmarks. You are building domain-specific eval harnesses for high-stakes use cases where a wrong answer carries real consequences.

You will work closely with the MLOps Engineer, the PM, and the deployment team. The evaluations you design are the mechanism by which the team determines whether what we've built is good enough to deploy — and whether it stays good after deployment.

 

What You'll Do

•  Design and build evaluation frameworks for Sarvam's AI outputs across domain-specific requirements: document comprehension, command summarisation, geospatial reasoning, enterprise workflow automation, and others as they emerge

•  Define quality metrics in collaboration with domain experts and clients; translate operational requirements into measurable, defensible signals

•  Run structured evaluation cycles pre- and post-deployment; build dashboards that surface model quality in production

•  Identify failure modes, edge cases, and distribution shifts — with the bias of someone looking for what's wrong, not confirming what's right

•  Collaborate with the MLOps Engineer to operationalise eval pipelines — automated, triggered by deployment events, versioned, and reproducible

•  Build and manage domain-specific datasets for fine-tuning, evaluation, and benchmarking — including human annotation workflows where needed

•  Publish internal findings and quality reports that feed the product and engineering roadmap

 

What We're Looking For

•  3–6 years in data science, ML research, or applied AI; at least 2 years working with LLMs in production contexts

•   Strong statistics and probability fundamentals — you understand what makes an evaluation valid and what makes it misleading

•   Experience designing evaluation frameworks from scratch: custom metrics, inter-rater reliability, red-teaming methodologies

•   Python proficiency; comfort with pandas, NumPy, HuggingFace datasets, RAGAS, EleutherAI Eval Harness, LangSmith, or equivalent

•   Experience with prompt engineering, model fine-tuning, or RLHF in applied settings

•   Ability to work with unstructured domain data: PDFs, doctrine documents, transcripts, and field reports

 

Bonus Points

•  Prior work in high-stakes domains (healthcare, legal, defence, finance) where output quality carries real-world consequences

•   Red-teaming or adversarial evaluation experience

 

Note: We are looking for people who can own the outcomes described here, not people who

match every line of this specification. If this problem excites you and you believe you can do this work, we want to hear from you.

 

Why Sarvam?

Sarvam is a fast-moving, high talent-density team building full-stack AI for India, working on problems that push the frontiers of AI with real population-scale impact.

•  Work alongside researchers, engineers, builders, and business leaders who move fast and hold each other to a very high bar

•  High ownership and high impact, from day one

•   Everything we do is AI-first, from the way we build and ship to the way we think about problems

•   You can work on problems that could change how an entire country learns, works, and communicates

 

If you want to work on problems at the frontier of AI in India, Sarvam is the place to be.

Skills Required

  • 3-6 years in data science, ML research, or applied AI; at least 2 years working with LLMs in production contexts
  • Strong statistics and probability fundamentals
  • Experience designing evaluation frameworks from scratch (custom metrics, inter-rater reliability, red-teaming methodologies)
  • Python proficiency
  • Experience with pandas and NumPy
  • Experience with HuggingFace Datasets
  • Experience with RAGAS
  • Experience with EleutherAI Eval Harness
  • Experience with LangSmith or equivalent evaluation/observability tools
  • Experience with prompt engineering, model fine-tuning, or RLHF in applied settings
  • Ability to work with unstructured domain data (PDFs, doctrine documents, transcripts, field reports)
  • Prior work in high-stakes domains (healthcare, legal, defence, finance)
  • Red-teaming or adversarial evaluation experience
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Bangalore, Karnataka
50 Employees
Year Founded: 2023

What We Do

We are an AI/ML research and development company on a mission to build reliable, performant, enterprise-grade AI systems at scale for India. We are committed to build the full-stack for generative AI for the rich & diverse landscape of India, mainly investing in: 1) Models: developing both efficient large scale Indic language models as well as bespoke enterprise models 2) Platform: building an enterprise-grade platform that empowers organisations to develop and ship creative and performant genAI applications at scale 3) Ecosystem: contributing to open-source models and datasets, as well as leading efforts for large scale data curation in public-good space

Similar Jobs

Micron Technology Logo Micron Technology

Principal Engineer

Artificial Intelligence • Hardware • Information Technology • Machine Learning
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
45000 Employees

TransUnion Logo TransUnion

Support Engineer

Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics
Hybrid
World Trade Center, Yeshwanthpur, Bengaluru Urban, Karnataka, IND
13000 Employees

DigitalOcean Logo DigitalOcean

Staff Software Engineer

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
1400 Employees

Toast Logo Toast

Quality Assurance Automation Engineer

Cloud • Fintech • Food • Information Technology • Software • Hospitality
In-Office
Bangalore, Bengaluru Urban, Karnataka, IND
5000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account