Fundamental

Research Data Scientist

Reposted 4 Days Ago

Be an Early Applicant

Barcelona, Cataluña, ESP

In-Office

Mid level

Artificial Intelligence • Software

The Role

The Data Research Engineer will enhance ML model development by identifying data sources, building ETL pipelines, and collaborating on data storage solutions.

Summary Generated by Built In

About Fundamental

Fundamental is an AI company pioneering the future of enterprise decision-making. Founded by DeepMind alumni, Fundamental has developed NEXUS – the world's most powerful Large Tabular Model (LTM) – purpose-built for the structured records that actually drive enterprise decisions. Backed by world class investors and trusted by Fortune 100 companies, Fundamental unlocks trillions of dollars of value by giving businesses the Power to Predict.

At Fundamental, you'll work on unprecedented technical challenges in foundation model development and build technology that transforms how the world's largest companies make decisions. This is your opportunity to be part of a category-defining company from the ground-up. Join the team defining the future of enterprise AI.

Key responsibilities

As part of the Research team, you will contribute to the development of breakthrough machine learning models by working on one of the most important frontiers in model training and evaluation: high-quality real and synthetic data.

This role is especially focused on synthetic data generation, Structural Causal Models (SCMs), and realistic simulation-based data sources. You will help us design, evaluate, and scale datasets that capture the structure, dependencies, and edge cases needed to train foundation models for enterprise tabular data.

The main responsibilities of this role are:

Identifying, characterizing, and evaluating high-value data sources for training and evaluating ML models, including real-world data, synthetic data, SCM-generated data, and physical or systems-based simulator outputs
Designing and analysing synthetic data generation approaches based on Structural Causal Models, probabilistic models, simulators, and other mechanisms that capture realistic relationships between variables
Working with researchers to define what makes a synthetic dataset useful, realistic, diverse, causally meaningful, and appropriate for model training or evaluation
Building tools and workflows to generate, validate, benchmark, and iterate on synthetic datasets at scale
Developing metrics and evaluation procedures for synthetic data quality
Transforming structured, unstructured, simulated, and causally generated data into formats suitable for training and evaluating large-scale ML models
Collaborating with the research team to maintain a reliable, efficient training pipeline where data quality, data diversity, and synthetic data generation are critical components
Collaborating with the wider engineering and infrastructure team to ensure data generation and processing workflows are scalable, reproducible, and robust

Must have

Experience with:

Synthetic data generation for machine learning, especially for structured or tabular data
Structural Causal Models, causal graphs, causal inference, probabilistic modelling, or simulation-based data generation
Identifying and evaluating high-quality data sources to train and evaluate ML models, including both real-world and realistic synthetic data sources
Bringing data from structured and unstructured sources, simulators, causal models, or generative processes into formats accessible by ML models
Designing quantitative analyses to assess data quality, realism, diversity, bias, coverage, and downstream model performance

Strong fundamentals in:

Statistics, probability, and applied machine learning
Data science workflows, including exploratory analysis, feature understanding, validation, and experimental design
Software engineering for research-grade and production-grade data workflows

Strong knowledge of:

Python data processing and scientific computing stack, including numpy, pandas, scipy, scikit-learn, or similar tools

Familiarity with:

Causal modelling, graphical models, probabilistic programming, agent-based simulation, discrete-event simulation, or physical / systems-based simulators
Data storage and data versioning solutions
Classical machine learning and deep learning methods, especially outside of purely LLM-based workflows

Nice to have

Contributions to open source ML, causal inference, synthetic data, simulation, or data science projects
BSc, MSc, or PhD in computer science, machine learning, statistics, mathematics, physics, engineering, economics, or another quantitative field
Experience working with tabular data, predictive analytics, or enterprise decision-making systems
Experience building or evaluating synthetic datasets for model training
Experience with SCM libraries, probabilistic programming frameworks, simulation environments, or custom data generation pipelines

Benefits

Competitive compensation with salary and equity
Comprehensive health coverage for you and your dependents
Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
Relocation support for employees moving to join the team in one of our office locations
A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action

Skills Required

Experience with identifying data sources for ML models
Experience with structured and unstructured data access
Strong fundamentals of software engineering
Strong knowledge of Python
Knowledge of Python data processing stack (numpy, pandas)
Familiarity with distributed processing (e.g., Ray, Dask, Spark, Beam)
Familiarity with data storage solutions
Basic ML knowledge

View all jobs at Fundamental

View Fundamental Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Sparks, MD

54 Employees

Year Founded: 2024

What We Do

For decades companies have relied on archaic tools to inform decisions and make bets on the future. Until now. Fundamental empowers businesses to turn gambles into guarantees and determine their future with far greater accuracy than ever before. Built by DeepMind alumni and trusted by Fortune 100 enterprises, NEXUS is our most powerful Large Tabular Model (LTM). By revealing the hidden language of tables, NEXUS unlocks trillions of dollars of value by giving businesses the Power to Predict™.