EvolutionaryScale

Data Platform Engineer Lead (AI)

Posted 6 Months Ago

2 Locations

In-Office

Senior level

Artificial Intelligence • Software

The Role

As a Data Infrastructure Engineer, you will enhance data processing pipelines, manage data infrastructure, and implement best practices for large-scale data handling.

Summary Generated by Built In

EvolutionaryScale’s mission is to develop artificial intelligence to understand biology for the benefit of human health and society, through open, safe, and responsible research, and in partnership with the scientific community. Over the next ten years AI will transform biological design, making molecules and entire cells programmable. We will develop the foundation models for biology that enable this.

To continue to move the field forward in this emerging area, we prioritize individuals who have shown excellence and creativity in their respective domains over specific domain expertise. Having both biology and AI expertise is great, but not a requirement.

Our team does both deep research and product development, not only building the frontier biological AI models in the field but also putting them in the hands of the researchers at the forefront of the life sciences. This fundamentally requires elite engineers and scientists working together to solve big research and product challenges. We are building a world class multi-disciplinary team spanning AI research, engineering, biology research, and business roles, which requires strong communication and collaboration across roles.

The EvolutionaryScale team is based in two locations: San Francisco and New York. We believe in flexibility around work schedules and locations but expect that our team members will work half of the days or more of most weeks from one of our two offices.

The Role

As our Data Platform Lead Engineer, you'll own the architecture and execution of EvolutionaryScale's data platform - the backbone that powers training, evaluation, and discovery across our models. You'll build reliable, scalable, and transparent pipelines that process biological data at unprecedented scale and ensure every dataset - pre-training or post-training - is high-quality, reproducible, and traceable. You'll collaborate closely with bioinformatics, modeling, research and infrastructure teams to design systems that enable our scientists and modelers to move faster, experiment more effectively, and generate insight from massive biological data.

Architect and operate large-scale data processing pipelines (batch + streaming) for pre-and post-training biology datasets - covering raw sequence, structure, and model-generated data.
Build and evolve our data platform: data lakes/lakehouses, metadata and lineage systems, feature stores, and orchestration frameworks.
Define and implement data cataloging, governance, and versioning practices that ensure full reproducibility and traceability across datasets.
Collaborate with researchers and ML engineers to translate modeling requirements into robust data systems that optimize throughput, reliability, and cost.
Establish best practices for data CI/CD, observability, infrastructure-as-code, fault tolerance, and data quality monitoring.
Continuously explore and integrate emerging technologies - Ray, Spark, Flink, modern data mesh approaches - to keep our stack state-of-the-art.

Preferred qualifications

Senior-level engineer with 3+ years (ideally 5+) designing and scaling large-scale data infrastructure.
Deep experience with distributed data frameworks such as Spark, Ray, or Flink for large-volume, high-throughput processing.
Foundation in data platform concepts: metadata stores, lineage tracking, orchestration tools, schema evolution, dataset versioning.
Skilled in debugging, performance optimization, and building observability into data systems.
Collaborative mindset - you can partner effectively with scientists, ML engineers, and infrastructure teams.
Excited by the chance to define new standards for data infrastructure in a domain that could reshape medicine and biology.
Experience with major cloud providers (AWS, GCP, or Azure), including familiarity with data warehousing tools is a plus.
Knowledge of large-scale distributed systems, machine learning, biology and biology datasets is a plus.

What success looks like

Datasets for training and evaluation are reliable, reproducible, and lineage aware.
Model and research teams can move faster thanks to self-service, high-throughput data pipelines.
Our data infrastructure scales efficiently with compute and storage demands as model size and scope grow.

The salary range for this position is $150,000 to $350,000 per year, plus a competitive equity package. Compensation package will vary based on job-related skills, experience, and knowledge. The compensation package also includes comprehensive medical, dental, and vision benefits.