AIRoA (AI Robot Association)

R&D-022 Data Engineer

Reposted 8 Days Ago

Be an Early Applicant

Tokyo, JPN

In-Office

Senior level

Artificial Intelligence • Information Technology • Robotics

The Role

Design and implement large-scale data pipelines for robotics foundation models, ensuring efficient data processing, quality, and collaboration with researchers.

Summary Generated by Built In

About AIRoA

The AI Robot Association (AIRoA) is launching a groundbreaking initiative: collecting one million hours of humanoid robot operation data with hundreds of robots, and leveraging it to train the world’s most powerful Vision-Language-Action (VLA) models.

What makes AIRoA unique is not only the unprecedented scale of real-world data and humanoid platforms, but also our commitment to making everything open and accessible. We are building a shared “robot data ecosystem” where datasets, trained models, and benchmarks are available to everyone. Researchers around the world will be able to evaluate their models on standardized humanoid robots through our open evaluation platform.

For researchers, this means an opportunity to:

Work on fundamental challenges in robotics and AI: multimodal learning, tactile-rich manipulation, sim-to-real transfer, and large-scale benchmarking.
Access state-of-the-art infrastructure: hundreds of humanoid robots, GPU clusters, high-fidelity simulators, and a global-scale evaluation pipeline.
Collaborate with leading experts across academia and industry, and publish results that will shape the next decade of robotics.
Contribute to an initiative that will redefine the future of embodied AI—with all results made open to the world.

Key Responsibilities

You will play a critical role in building the data backbone powering next-generation robotics foundation models:

Design and implement large-scale data pipelines that cover the full lifecycle of high-quality datasets for robotics foundation models—collection, processing, curation, and publishing.
Design, build, and maintain data schemas, storage solutions, and query interfaces to enable VLA researchers to efficiently discover, query, and consume curated datasets.
Collaborate closely with VLA researchers to capture evolving data requirements and continuously improve data pipelines through analysis and experimentation.
Design and scale distributed data-processing pipelines capable of handling petabyte-scale multimodal datasets (e.g., RGB/Depth, point clouds) with full lineage and reproducibility.
Define data-quality metrics and build feedback loops to continuously monitor and improve data quality.

RequirementsRequired Qualifications

【1. Academic & Professional】

Master’s degree in Computer Science, Engineering, or related field (or equivalent practical experience).
5+ years professional experience in data engineering / data platform development.
Proven record of delivering production-grade, distributed data systems.

【2. ETL / Distributed Data Processing】

3+ years designing and operating large-scale ETL / ELT pipelines using Spark, Flink, Ray or similar distributed engine.
Hands-on xperience with using orchestration tools and designing pipelines (Airflow, Kedro, Dagster).
Proven optimization of workloads (10TB+/day scale).

【3. Lakehouse / Storage Architecture】

Designed or led implementations using Delta Lake, Apache Iceberg, or Hudi.
Integrated with Trino, Athena, Databricks SQL, or Glue/Unity Catalog.
Defined schema evolution, ACID compliance, partitioning strategy, time travel, and cost-performance optimization.
Managed metadata, lineage, and catalog governance.
Equivalent experience (e.g., BigQuery-based warehouse with versioned schema management) will also be recognized.

【4. Data Modeling / Quality / Governance】

Built bronze/silver/gold data layer structures with dbt or equivalent.
Defined and enforced data quality SLAs (freshness, completeness, accuracy).
Experience with Great Expectations, DataHub, OpenMetadata, or Monte Carlo.
Implemented schema versioning, audit logging, and lineage tracking.
Designed and owned data access control and catalog taxonomy.

【5. Domain Understanding & Business Value】

Collaborated with product / analytics / AI teams to align platform design with business KPIs.
Quantified platform impact (e.g., ↓30% compute cost, ↑3× query performance).
Can explain how architecture decisions drive measurable business outcomes.

Preferred Qualifications

Experience working with terabyte or petabyte-scale datasets.
Expertise in data lake storage systems such as Apache Iceberg or Delta Lake with query systems such as Trino and catalog systems such as Nessie.
Expertise in distributed processing frameworks like Spark, Flink, or Ray.
Expertise in workflow tools such as Airflow, Kedro, or Dagster.
Experience in analyzing, monitoring, and managing data quality.

Others (linguistic qualification, etc.)

【Highly appreciated】 English proficiency at business level; Japanese proficiency a plus.

Benefits

There are currently no comparable projects in the world that collect data and develop foundation models on such a large scale. As mentioned above, this is one of Japan’s leading national projects, supported by a substantial investment of 20.5 billion yen from NEDO.

This position will play a crucial role in determining the success of the project. You will have broad discretion and responsibility, and we are confident that, if successful, you will gain both a great sense of achievement and the opportunity to make a meaningful contribution to society.

Furthermore, we strongly encourage engineers to actively build their careers through this project—for example, by publishing research papers and engaging in academic activities.

●Work location

Tokyo Ryutsu Center A Bldg. AW4-5, 6-1-1 Heiwajima, Ota-ku, Tokyo 143-0006, Japan

View all jobs at AIRoA (AI Robot Association)

View AIRoA (AI Robot Association) Profile

Report Job

Am I A Good Fit?

beta