Research Data Engineer

Posted 2 Days Ago
Be an Early Applicant
Bengaluru, Bengaluru Urban, Karnataka, IND
In-Office
Mid level
Artificial Intelligence • Conversational AI • Generative AI
The Role
Build high-throughput streaming and batch pipelines for audio, text, and multimodal data; clean, deduplicate, normalize and curate multilingual datasets; design ML and heuristic filtering, active learning and data selection loops; create scalable evaluation/benchmarking pipelines and tooling to accelerate research iterations for ASR, TTS, and conversational systems.
Summary Generated by Built In

Research Data Engineer (India) — Smallest.ai
About the Role

This is not a typical data engineering role. You won’t be building dashboards. You won’t be maintaining pipelines no one touches.

You will take messy, noisy, real-world data — and turn it into something models can learn from. Think of this as running a gold mine - you take dust and convert it to gold.

We work on speech, language, and real-time systems across 50+ languages.
The difference between a good model and a great one is almost always data quality + data systems. That’s where you come in.

What You’ll Work On
  • Data Pipelines (Real-time + Batch)

    • Build high-throughput pipelines for audio, text, and multimodal data

    • Streaming + offline processing at scale

  • Data Quality & Curation

    • Cleaning, filtering, deduplication, normalization (numbers, emails, code-mix, etc.)

    • Designing heuristics + ML-based data filtering systems

  • Multilingual Data Systems

    • Handling 50+ languages, accents, and code-mixed inputs

    • Language-aware normalization and segmentation

  • Training Data Engine

    • Build pipelines that continuously generate better training data from production

    • Active learning loops, data selection, sampling strategies

  • Evaluation & Benchmarking Pipelines

    • Create scalable eval datasets across languages and domains

    • Automate quality tracking for ASR, TTS, and conversational systems

  • Data Infra for Research

    • Work closely with research team to unblock experiments fast

    • Build systems that reduce iteration time from weeks → hours

What This Role Is NOT
  • Not a dashboard/reporting role

  • Not a “move data from A to B” role

  • Not a maintenance-heavy legacy pipeline role

What We’re Looking For
  • Strong fundamentals in data structures, systems, and pipelines

  • Experience with large-scale data processing (audio/text preferred)

  • Comfortable with messy, unstructured, real-world data

  • Strong coding skills (Python required; systems experience is a plus)

  • Understanding of ML/data pipelines (training, eval, data curation)

Bonus (Not Mandatory)
  • Experience with speech/audio data (ASR/TTS)

  • Familiarity with multilingual datasets

  • Experience with streaming systems (Kafka, etc.)

  • Exposure to data-centric AI / data quality frameworks

How We Work
  • Speed over perfection

  • Production over papers

  • Systems that scale, not scripts that barely work

  • Tight loop between data → model → eval → improvement

Who This Is For
  • You enjoy working with raw, chaotic data

  • You care about data quality more than tooling hype

  • You like building systems that directly impact model performance

  • You get excited by turning unusable data into competitive advantage

Why Join Us

We’re building real-time, multilingual voice AI systems.

At this level, models are only as good as the data behind them.

If you want to work on the layer that actually moves the needle - this is it.

Skills Required

  • Strong fundamentals in data structures, systems, and pipelines
  • Experience with large-scale data processing (audio/text preferred)
  • Comfortable working with messy, unstructured, real-world data
  • Strong coding skills in Python
  • Understanding of ML/data pipelines (training, evaluation, data curation)
  • Experience with speech/audio data (ASR/TTS)
  • Familiarity with multilingual datasets and language-aware normalization
  • Experience with streaming systems (e.g., Kafka)
  • Exposure to data-centric AI / data quality frameworks
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
10 Employees
Year Founded: 2023

Similar Jobs

Cartesia Logo Cartesia

Research Engineer, Data (India)

Artificial Intelligence • Software
In-Office
Bangalore, Bengaluru Urban, Karnataka, IND
33 Employees

ServiceNow Logo ServiceNow

Business Development Representative

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Bangalore, Bengaluru Urban, Karnataka, IND
29000 Employees

Boeing Logo Boeing

Software Engineering Manager

Aerospace • Information Technology • Software • Cybersecurity • Design • Defense • Manufacturing
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
170000 Employees

Wells Fargo Logo Wells Fargo

Data Management Analyst

Fintech • Financial Services
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
205000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
LTX Thumbnail
Conversational AI • Generative AI
Jerusalem, Israel
360 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account