Research Scientist, Data

Posted 5 Hours Ago
Be an Early Applicant
Palo Alto, CA, USA
In-Office
185K-400K Annually
Senior level
Information Technology
The Role
Design, build, and own large-scale data pipelines and ML data curation for multimodal (text, image, audio, video) model training. Partner with research teams to ingest, label, filter, augment, store, and ensure dataset quality, compliance, and scalability. Optimize distributed data processing for training, prototype dataset tools, and integrate research advances into production systems.
Summary Generated by Built In
About the Role

At Pika, we are pioneering the next generation of creative infrastructure built around real-time, multimodal generation and intelligent agentic platforms. We are looking for a staff or lead-level Research Engineer, Data to architect and scale data engineering systems supporting model training for our advanced multimodal foundation models. This pivotal role will strengthen our research teams by building, optimizing, and owning large-scale data pipelines and robust ML data curation, ensuring our foundation models have access to the highest quality and most diverse datasets. If you are passionate about powerful data infrastructure and innovative research-engineering, join us to make an impact for millions of creators.

 
What You’ll Do
  • Take ownership of large-scale data pipeline architecture and implementation to support model training and research workflows for text, image, audio, and video datasets

  • Partner with research and engineering teams to curate, clean, and manage diverse, sensory-rich datasets for pre-training and mid-training of multimodal models

  • Develop strategies and tools for scalable data ingestion, labeling, filtering, augmentation, and storage

  • Ensure data quality, reliability, and compliance, including managing privacy and ethical considerations throughout the data lifecycle

  • Optimize data processing, transformation, and delivery for large-scale distributed training pipelines

  • Prototype and productionize new methods for dataset creation, management, and continuous improvement in response to researcher needs

  • Contribute to the integration of research-driven data advancements into production-ready systems

  • Stay informed on emerging data engineering and ML data management developments, bringing best practices to our systems

 
What We’re Looking For
  • 5+ years of experience building and scaling data pipelines for machine learning applications at staff or lead engineer level, ideally in research or model training environments

  • Strong background in data engineering and ML data curation for LLMs, VLMs, or other large-scale multimodal models

  • Expertise in distributed data systems (e.g., Spark, Hadoop, Ray, or similar) and efficient large dataset processing/ETL workflows

  • Proven ability to build robust, scalable, and production-grade data infrastructure for ML pipelines

  • Experience developing tools for data labeling, filtering, deduplication, quality assurance, and dataset management

  • Strong programming skills (Python, SQL, PySpark, or similar) and familiarity with cloud data platforms (AWS, GCP, Azure)

  • Knowledge of privacy, compliance, ethics, and best practices in data collection and management

  • Excellent cross-functional collaboration, problem-solving, and communication skills

  • Passion for enabling cutting-edge generative AI and creative technology through data excellence

 
What We Offer
  • Competitive salary and substantial equity in a high-growth startup

  • Full health benefits, 401k matching, and more

  • Collaborative, mission-driven team environment with major growth opportunities

  • Flexible on-site/remote hybrid (HQ in Palo Alto, CA)

 
About Pika

Pika empowers creators by building state-of-the-art agentic and multimedia platforms. Our vision is to break down technical barriers to creativity, making real-time generative and intelligent orchestration accessible to all. Join us and help shape the next evolution of creative technology!

 

If you are a data-driven research engineer excited to lead and scale the data infrastructure powering real-time multimodal foundation models, we want to hear from you.

Skills Required

  • 5+ years building and scaling data pipelines for machine learning in staff or lead engineer roles
  • Background in data engineering and ML data curation for LLMs, VLMs, or large-scale multimodal models
  • Expertise with distributed data systems (Spark, Hadoop, Ray, or similar) and large dataset ETL workflows
  • Proven ability to build robust, scalable, production-grade data infrastructure for ML pipelines
  • Experience developing tools for data labeling, filtering, deduplication, quality assurance, and dataset management
  • Strong programming skills (Python, SQL, PySpark or similar) and familiarity with cloud data platforms (AWS, GCP, Azure)
  • Knowledge of privacy, compliance, ethics, and best practices in data collection and management
  • Excellent cross-functional collaboration, problem-solving, and communication skills
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
29 Employees
Year Founded: 2023

What We Do

An idea-to-video platform that brings your creativity to motion

Similar Jobs

AfterQuery Logo AfterQuery

Scientist

Artificial Intelligence • Big Data
In-Office
San Francisco, CA, USA
200 Employees
250K-450K Annually

ifm Logo ifm

Scientist

Information Technology • Automation • Manufacturing
In-Office
Sunnyvale, CA, USA
3924 Employees
150K-450K Annually

ifm Logo ifm

Scientist

Information Technology • Automation • Manufacturing
In-Office
Sunnyvale, CA, USA
3924 Employees
150K-450K Annually

Tempus AI Logo Tempus AI

Data Scientist

Artificial Intelligence • Big Data • Healthtech • Machine Learning • Analytics • Biotech • Generative AI
Hybrid
4 Locations
3775 Employees
90K-150K Annually

Similar Companies Hiring

Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account