Smallest.ai Jobs

Research Data Engineer

Smallest.ai

Research Data Engineer

Posted 2 Days Ago

Be an Early Applicant

Bengaluru, Bengaluru Urban, Karnataka, IND

In-Office

Mid level

Artificial Intelligence • Conversational AI • Generative AI

The Role

Build high-throughput streaming and batch pipelines for audio, text, and multimodal data; clean, deduplicate, normalize and curate multilingual datasets; design ML and heuristic filtering, active learning and data selection loops; create scalable evaluation/benchmarking pipelines and tooling to accelerate research iterations for ASR, TTS, and conversational systems.

Summary Generated by Built In

Research Data Engineer (India) — Smallest.ai
About the Role

This is not a typical data engineering role. You won’t be building dashboards. You won’t be maintaining pipelines no one touches.

You will take messy, noisy, real-world data — and turn it into something models can learn from. Think of this as running a gold mine - you take dust and convert it to gold.

We work on speech, language, and real-time systems across 50+ languages.
The difference between a good model and a great one is almost always data quality + data systems. That’s where you come in.

What You’ll Work On

Data Pipelines (Real-time + Batch)
- Build high-throughput pipelines for audio, text, and multimodal data
- Streaming + offline processing at scale
Data Quality & Curation
- Cleaning, filtering, deduplication, normalization (numbers, emails, code-mix, etc.)
- Designing heuristics + ML-based data filtering systems
Multilingual Data Systems
- Handling 50+ languages, accents, and code-mixed inputs
- Language-aware normalization and segmentation
Training Data Engine
- Build pipelines that continuously generate better training data from production
- Active learning loops, data selection, sampling strategies
Evaluation & Benchmarking Pipelines
- Create scalable eval datasets across languages and domains
- Automate quality tracking for ASR, TTS, and conversational systems
Data Infra for Research
- Work closely with research team to unblock experiments fast
- Build systems that reduce iteration time from weeks → hours

What This Role Is NOT

Not a dashboard/reporting role
Not a “move data from A to B” role
Not a maintenance-heavy legacy pipeline role

What We’re Looking For

Strong fundamentals in data structures, systems, and pipelines
Experience with large-scale data processing (audio/text preferred)
Comfortable with messy, unstructured, real-world data
Strong coding skills (Python required; systems experience is a plus)
Understanding of ML/data pipelines (training, eval, data curation)

Bonus (Not Mandatory)

Experience with speech/audio data (ASR/TTS)
Familiarity with multilingual datasets
Experience with streaming systems (Kafka, etc.)
Exposure to data-centric AI / data quality frameworks

How We Work

Speed over perfection
Production over papers
Systems that scale, not scripts that barely work
Tight loop between data → model → eval → improvement

Who This Is For

You enjoy working with raw, chaotic data
You care about data quality more than tooling hype
You like building systems that directly impact model performance
You get excited by turning unusable data into competitive advantage

Why Join Us

We’re building real-time, multilingual voice AI systems.

At this level, models are only as good as the data behind them.

If you want to work on the layer that actually moves the needle - this is it.

Skills Required

Strong fundamentals in data structures, systems, and pipelines
Experience with large-scale data processing (audio/text preferred)
Comfortable working with messy, unstructured, real-world data
Strong coding skills in Python
Understanding of ML/data pipelines (training, evaluation, data curation)
Experience with speech/audio data (ASR/TTS)
Familiarity with multilingual datasets and language-aware normalization
Experience with streaming systems (e.g., Kafka)
Exposure to data-centric AI / data quality frameworks

View all jobs at Smallest.ai

View Smallest.ai Profile

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

10 Employees

Year Founded: 2023

Similar Jobs

Cartesia

Research Engineer, Data (India)

Artificial Intelligence • Software

In-Office

Bangalore, Bengaluru Urban, Karnataka, IND

33 Employees

ServiceNow

Business Development Representative

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation

Remote or Hybrid

Bangalore, Bengaluru Urban, Karnataka, IND

29000 Employees

Boeing

Software Engineering Manager

Aerospace • Information Technology • Software • Cybersecurity • Design • Defense • Manufacturing

In-Office

Bengaluru, Bengaluru Urban, Karnataka, IND

170000 Employees

Wells Fargo

Data Management Analyst

Fintech • Financial Services

Hybrid

Bengaluru, Bengaluru Urban, Karnataka, IND

205000 Employees

Similar Companies Hiring

Hanover Park Thumbnail

Hanover Park

Artificial Intelligence • Fintech • Software • Financial Services

New York, New York

42 Employees

LTX

Conversational AI • Generative AI

Jerusalem, Israel

360 Employees

Onshore Thumbnail

Onshore

Artificial Intelligence • Fintech • Software • Financial Services

New York, New York

60 Employees

View all jobs at Smallest.ai

View Smallest.ai Profile

Oops, something went wrong. Please try again.