Researcher: Multimodal (Data)

Sorry, this job was removed at 06:08 p.m. (CST) on Thursday, Sep 11, 2025
Be an Early Applicant
San Francisco, CA
In-Office
Artificial Intelligence • Software
The Role
About Cartesia

Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

What You'll Do

  • Lead the design, creation, and optimization of datasets for training and evaluating multimodal models across diverse modalities, including audio, text, video, and images.

  • Develop strategies for curating, aligning, and augmenting multimodal datasets to address challenges in synchronization, variability, and scalability.

  • Design innovative methods for data augmentation, synthetic data generation, and cross-modal sampling to enhance the diversity and robustness of datasets.

  • Create datasets tailored for specific multimodal tasks, such as audio-visual speech recognition, text-to-video generation, or cross-modal retrieval, with attention to real-world deployment needs.

  • Collaborate closely with researchers and engineers to ensure datasets are optimized for target architectures, training pipelines, and task objectives.

  • Build scalable pipelines for multimodal data processing, annotation, and validation to support research and production workflows.

What You'll Bring

  • Expertise in multimodal data curation and processing, with a deep understanding of challenges in combining diverse data types like audio, text, images, and video.

  • Proficiency in tools and libraries for handling specific modalities, such as librosa (audio), OpenCV (video), and Hugging Face (text).

  • Familiarity with data alignment techniques, including time synchronization for audio and video, embedding alignment for cross-modal learning, and temporal consistency checks.

  • Strong understanding of multimodal dataset design principles, including methods for ensuring data diversity, sufficiency, and relevance for targeted applications.

  • Programming expertise in Python and experience with frameworks like PyTorch or TensorFlow for building multimodal data pipelines.

  • Comfortable with large-scale data processing and distributed systems for multimodal dataset storage, processing, and management.

  • A collaborative mindset with the ability to work cross-functionally with researchers, engineers, and product teams to align data strategies with project goals.

Nice-to-Haves

  • Experience in creating synthetic multimodal datasets using generative models, simulation environments, or advanced augmentation techniques.

  • Background in annotating and aligning multimodal datasets for tasks such as audio-visual speech recognition, video-captioning, or multimodal reasoning.

  • Early-stage startup experience or a proven track record of building datasets for cutting-edge research in fast-paced environments.

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

Our perks

🍽 Lunch, dinner and snacks at the office.

🏥 Fully covered medical, dental, and vision insurance for employees.

🏦 401(k).

✈️ Relocation and immigration support.

🦖 Your own personal Yoshi.

Similar Jobs

Anduril Logo Anduril

Technical Recruiter

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
In-Office
Costa Mesa, CA, USA
6000 Employees
50-70 Annually

Anduril Logo Anduril

Engineering Program Specialist

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
In-Office
Irvine, CA, USA
6000 Employees
98K-130K Annually

CoreWeave Logo CoreWeave

Senior Security Engineer

Cloud • Information Technology • Machine Learning
In-Office
4 Locations
1450 Employees
139K-242K Annually

CoreWeave Logo CoreWeave

Senior Security Engineer

Cloud • Information Technology • Machine Learning
In-Office
4 Locations
1450 Employees
165K-242K Annually
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
33 Employees
Year Founded: 2023

What We Do

Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Try Sonic at https://play.cartesia.ai and join our Discord at https://discord.com/invite/gAbbHgdyQM.

Similar Companies Hiring

Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account