About the Role
This is not a typical data engineering role. You won’t be building dashboards. You won’t be maintaining pipelines no one touches.
You will take messy, noisy, real-world data — and turn it into something models can learn from. Think of this as running a gold mine - you take dust and convert it to gold.
We work on speech, language, and real-time systems across 50+ languages.
The difference between a good model and a great one is almost always data quality + data systems. That’s where you come in.
Data Pipelines (Real-time + Batch)
Build high-throughput pipelines for audio, text, and multimodal data
Streaming + offline processing at scale
Data Quality & Curation
Cleaning, filtering, deduplication, normalization (numbers, emails, code-mix, etc.)
Designing heuristics + ML-based data filtering systems
Multilingual Data Systems
Handling 50+ languages, accents, and code-mixed inputs
Language-aware normalization and segmentation
Training Data Engine
Build pipelines that continuously generate better training data from production
Active learning loops, data selection, sampling strategies
Evaluation & Benchmarking Pipelines
Create scalable eval datasets across languages and domains
Automate quality tracking for ASR, TTS, and conversational systems
Data Infra for Research
Work closely with research team to unblock experiments fast
Build systems that reduce iteration time from weeks → hours
Not a dashboard/reporting role
Not a “move data from A to B” role
Not a maintenance-heavy legacy pipeline role
Strong fundamentals in data structures, systems, and pipelines
Experience with large-scale data processing (audio/text preferred)
Comfortable with messy, unstructured, real-world data
Strong coding skills (Python required; systems experience is a plus)
Understanding of ML/data pipelines (training, eval, data curation)
Experience with speech/audio data (ASR/TTS)
Familiarity with multilingual datasets
Experience with streaming systems (Kafka, etc.)
Exposure to data-centric AI / data quality frameworks
Speed over perfection
Production over papers
Systems that scale, not scripts that barely work
Tight loop between data → model → eval → improvement
You enjoy working with raw, chaotic data
You care about data quality more than tooling hype
You like building systems that directly impact model performance
You get excited by turning unusable data into competitive advantage
We’re building real-time, multilingual voice AI systems.
At this level, models are only as good as the data behind them.
If you want to work on the layer that actually moves the needle - this is it.
Skills Required
- Strong fundamentals in data structures, systems, and pipelines
- Experience with large-scale data processing (audio/text preferred)
- Comfortable working with messy, unstructured, real-world data
- Strong coding skills in Python
- Understanding of ML/data pipelines (training, evaluation, data curation)
- Experience with speech/audio data (ASR/TTS)
- Familiarity with multilingual datasets and language-aware normalization
- Experience with streaming systems (e.g., Kafka)
- Exposure to data-centric AI / data quality frameworks









