Research Data Scientist

Reposted 9 Days Ago
Be an Early Applicant
Barcelona, Cataluña, ESP
In-Office
Mid level
Artificial Intelligence • Software
The Role
The Data Research Engineer will enhance ML model development by identifying data sources, building ETL pipelines, and collaborating on data storage solutions.
Summary Generated by Built In
About Fundamental

Fundamental is an AI company pioneering the future of enterprise decision-making. Founded by DeepMind alumni, Fundamental has developed NEXUS – the world's most powerful Large Tabular Model (LTM) – purpose-built for the structured records that actually drive enterprise decisions. Backed by world class investors and trusted by Fortune 100 companies, Fundamental unlocks trillions of dollars of value by giving businesses the Power to Predict.

At Fundamental, you'll work on unprecedented technical challenges in foundation model development and build technology that transforms how the world's largest companies make decisions. This is your opportunity to be part of a category-defining company from the ground-up. Join the team defining the future of enterprise AI.

Key responsibilities

As part of the Research team, you will contribute to the development of breakthrough machine learning models by working on one of the most important frontiers in model training and evaluation: high-quality real and synthetic data.

This role is especially focused on synthetic data generation, Structural Causal Models (SCMs), and realistic simulation-based data sources. You will help us design, evaluate, and scale datasets that capture the structure, dependencies, and edge cases needed to train foundation models for enterprise tabular data.

The main responsibilities of this role are:

  • Identifying, characterizing, and evaluating high-value data sources for training and evaluating ML models, including real-world data, synthetic data, SCM-generated data, and physical or systems-based simulator outputs

  • Designing and analysing synthetic data generation approaches based on Structural Causal Models, probabilistic models, simulators, and other mechanisms that capture realistic relationships between variables

  • Working with researchers to define what makes a synthetic dataset useful, realistic, diverse, causally meaningful, and appropriate for model training or evaluation

  • Building tools and workflows to generate, validate, benchmark, and iterate on synthetic datasets at scale

  • Developing metrics and evaluation procedures for synthetic data quality

  • Transforming structured, unstructured, simulated, and causally generated data into formats suitable for training and evaluating large-scale ML models

  • Collaborating with the research team to maintain a reliable, efficient training pipeline where data quality, data diversity, and synthetic data generation are critical components

  • Collaborating with the wider engineering and infrastructure team to ensure data generation and processing workflows are scalable, reproducible, and robust

Must have

Experience with:

  • Synthetic data generation for machine learning, especially for structured or tabular data

  • Structural Causal Models, causal graphs, causal inference, probabilistic modelling, or simulation-based data generation

  • Identifying and evaluating high-quality data sources to train and evaluate ML models, including both real-world and realistic synthetic data sources

  • Bringing data from structured and unstructured sources, simulators, causal models, or generative processes into formats accessible by ML models

  • Designing quantitative analyses to assess data quality, realism, diversity, bias, coverage, and downstream model performance

Strong fundamentals in:

  • Statistics, probability, and applied machine learning

  • Data science workflows, including exploratory analysis, feature understanding, validation, and experimental design

  • Software engineering for research-grade and production-grade data workflows

Strong knowledge of:

  • Python data processing and scientific computing stack, including numpy, pandas, scipy, scikit-learn, or similar tools

Familiarity with:

  • Causal modelling, graphical models, probabilistic programming, agent-based simulation, discrete-event simulation, or physical / systems-based simulators

  • Data storage and data versioning solutions

  • Classical machine learning and deep learning methods, especially outside of purely LLM-based workflows

Nice to have
  • Contributions to open source ML, causal inference, synthetic data, simulation, or data science projects

  • BSc, MSc, or PhD in computer science, machine learning, statistics, mathematics, physics, engineering, economics, or another quantitative field

  • Experience working with tabular data, predictive analytics, or enterprise decision-making systems

  • Experience building or evaluating synthetic datasets for model training

  • Experience with SCM libraries, probabilistic programming frameworks, simulation environments, or custom data generation pipelines

Benefits
  • Competitive compensation with salary and equity

  • Comprehensive health coverage for you and your dependents

  • Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys

  • Relocation support for employees moving to join the team in one of our office locations

  • A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action

Skills Required

  • Experience with identifying data sources for ML models
  • Experience with structured and unstructured data access
  • Strong fundamentals of software engineering
  • Strong knowledge of Python
  • Knowledge of Python data processing stack (numpy, pandas)
  • Familiarity with distributed processing (e.g., Ray, Dask, Spark, Beam)
  • Familiarity with data storage solutions
  • Basic ML knowledge
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Sparks, MD
54 Employees
Year Founded: 2024

What We Do

For decades companies have relied on archaic tools to inform decisions and make bets on the future. Until now. Fundamental empowers businesses to turn gambles into guarantees and determine their future with far greater accuracy than ever before. Built by DeepMind alumni and trusted by Fortune 100 enterprises, NEXUS is our most powerful Large Tabular Model (LTM). By revealing the hidden language of tables, NEXUS unlocks trillions of dollars of value by giving businesses the Power to Predict™.

Similar Jobs

Perk Logo Perk

Facilities Manager

Artificial Intelligence • Fintech • Greentech • Sales • Software • Travel • Hospitality
Hybrid
Barcelona, Cataluña, ESP
1800 Employees

Perk Logo Perk

Systems Analyst

Artificial Intelligence • Fintech • Greentech • Sales • Software • Travel • Hospitality
Hybrid
Barcelona, Cataluña, ESP
1800 Employees

Perk Logo Perk

Senior Sales Executive

Artificial Intelligence • Fintech • Greentech • Sales • Software • Travel • Hospitality
Hybrid
Barcelona, Cataluña, ESP
1800 Employees
Hybrid
Barcelona, Cataluña, ESP
2449 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account