Advisor - Scientific Data Engineer

Reposted 23 Days Ago
San Francisco, CA, USA
In-Office
167K-266K Annually
Senior level
Healthtech • Biotech • Pharmaceutical
The Role
The Scientific Data Engineer will design data architecture, build ETL/ELT pipelines, implement data quality checks, and develop semantic layers for AI-consumable data products in healthcare research.
Summary Generated by Built In

At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the world.

The Opportunity

We are building something unprecedented — an AI foundation that will push the frontier on what is possible today across drug discovery research, from target identification and disease biology through translational science.

The Applied Intelligence for Discovery (AI4D) team is a newly formed group within Lilly Research Laboratories that operates at the intersection of scientific delivery and core platform development. AI4D’s mission is connecting scientists to petabyte-scale data through natural language interfaces, automated analysis workflows, and intelligent search — and to convert early deployments into repeatable system standards and evaluation practices that scale across therapeutic areas.

As a Scientific Data Engineer, you will close that gap. You will build the semantic layer, data harmonization infrastructure, AI-ready data products, and lakehouse architecture that bridge how data is stored and how AI systems need to consume it. You will be working at the intersection of the data infrastructure team and the generative AI engineers who build the systems scientists interact with.

ResponsibilitiesData Harmonization and Lakehouse Architecture
  • Design and build the data architecture that transforms raw and processed omics data into harmonized, AI-consumable layers
  • Build and optimize ETL/ELT pipelines that produce denormalized views, pre-computed aggregations, embedding-ready text representations, and feature stores optimized for AI system consumption
  • Implement data quality monitoring, automated profiling, and validation checks across harmonization layers
  • Create versioned, reproducible data snapshots that support model training, evaluation, and audit requirements in a regulated environment
  • Partner with the teams to extend harmonization patterns as data modalities expand beyond genomics and proteomics into spatial transcriptomics, perturbational data (Perturb-Seq), single-cell, and digital pathology
Semantic Layer and Schema Engineering
  • Design and maintain a semantic layer over Lilly’s multi-omics databases that enables AI systems
  • Create comprehensive schema documentation: table descriptions, column-level annotations, relationship mappings, business logic rules, and domain-specific constraints (e.g., statistical thresholds, unit conventions, experimental design metadata)
  • Develop gold-standard question/SQL pairs for each major database, in collaboration with computational biologists and Generative AI Engineers, to serve as training data, few-shot examples, and evaluation benchmarks
  • Build and maintain a data dictionary and ontology mapping layer that translates how scientists think and speak about data (gene names, pathway terms, assay types) into how the data is physically stored
AI-Ready Data Products
  • Build and manage vector embedding pipelines for scientific documents, study metadata, and structured data descriptions to power RAG-based retrieval
  • Build integration pipelines that connect heterogeneous data sources — omics databases, internal publications, electronic lab notebooks, assay results, and clinical annotations — into a unified, queryable layer
  • Develop and enforce metadata standards that ensure new data sources are AI-accessible from the point of ingestion, not retroactively
  • Design data products that serve multiple consumption patterns: direct SQL access for computational biologists, structured feeds for ML training pipelines, and semantic interfaces for LLM-powered tools
Qualifications
  • Bachelors degree in Computer Science, Data Engineering, Bioinformatics, or a related field + 8 years data engineering experience OR Masters degree and 5 years data engineering experience
  • Demonstrated expertise in building data pipelines, ETL/ELT workflows, and data products that serve downstream AI/ML systems
Additional Skills/Preferences
  • Phd in data or related field
  • Strong SQL skills and experience with complex relational database schemas (hundreds of tables, multi-level joins, domain-specific conventions)
  • Experience with modern data platform technologies, including at least one of: Databricks, Snowflake, or equivalent lakehouse platforms
  • Experience with modern data engineering tools: dbt, Spark, Airflow, or similar orchestration and transformation frameworks
  • Proficiency in Python for data processing, scripting, and pipeline development
  • Experience with cloud data platforms (AWS preferred: Redshift, Athena, Glue, S3, or similar)
  • Familiarity with at least one of: vector databases, embedding pipelines, or semantic layer tooling
  • Strong communication skills — you can work effectively with both engineers who think in schemas and scientists who think in biology
  • Experience with biomedical or scientific data: omics datasets (RNA-seq, proteomics, GWAS), clinical data, or laboratory information management systems
  • Experience in pharmaceutical, biotech, or life sciences environments
  • Familiarity with biomedical ontologies and controlled vocabularies (Gene Ontology, MeSH, ChEBI, HGNC) and their application to data integration
  • Experience building data products that serve AI/ML systems — feature stores, training datasets, evaluation benchmarks, or semantic annotations for text-to-SQL
  • Knowledge of data governance practices in regulated industries: data lineage, access controls, versioning, and auditability
  • Experience with knowledge graph technologies (Neo4j, Amazon Neptune, RDF/SPARQL) or graph-based data modeling
  • Deep experience with Databricks ecosystem: Unity Catalog for data governance, Delta Lake for ACID transactions, MLflow integration, and Databricks SQL for analytics workloads
  • Experience designing data architectures that bridge traditional bioinformatics workflows (Nextflow, R/Bioconductor) with modern lakehouse consumption patterns

Lilly is dedicated to helping individuals with disabilities to actively engage in the workforce, ensuring equal opportunities when vying for positions. If you require accommodation to submit a resume for a position at Lilly, please complete the accommodation request form (https://careers.lilly.com/us/en/workplace-accommodation) for further assistance. Please note this is for individuals to request an accommodation as part of the application process and any other correspondence will not receive a response.

Lilly is proud to be an EEO Employer and does not discriminate on the basis of age, race, color, religion, gender identity, sex, gender expression, sexual orientation, genetic information, ancestry, national origin, protected veteran status, disability, or any other legally protected status.


Our employee resource groups (ERGs) offer strong support networks for their members and are open to all employees. Our current groups include: Africa, Middle East, Central Asia Network, Black Employees at Lilly, Chinese Culture Network, Japanese International Leadership Network (JILN), Lilly India Network, Organization of Latinx at Lilly (OLA), PRIDE (LGBTQ+ Allies), Veterans Leadership Network (VLN), Women’s Initiative for Leading at Lilly (WILL), enAble (for people with disabilities). Learn more about all of our groups.

Actual compensation will depend on a candidate’s education, experience, skills, and geographic location.  The anticipated wage for this position is

$166,500 - $266,200

Full-time equivalent employees also will be eligible for a company bonus (depending, in part, on company and individual performance). In addition, Lilly offers a comprehensive benefit program to eligible employees, including eligibility to participate in a company-sponsored 401(k); pension; vacation benefits; eligibility for medical, dental, vision and prescription drug benefits; flexible benefits (e.g., healthcare and/or dependent day care flexible spending accounts); life insurance and death benefits; certain time off and leave of absence benefits; and well-being benefits (e.g., employee assistance program, fitness benefits, and employee clubs and activities).Lilly reserves the right to amend, modify, or terminate its compensation and benefit programs in its sole discretion and Lilly’s compensation practices and guidelines will apply regarding the details of any promotion or transfer of Lilly employees.

#WeAreLilly

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Indianapolis, IN
39,451 Employees
Year Founded: 1876

What We Do

Eli Lilly and Company engages in the discovery, development, manufacture, and sale of products in pharmaceutical products business segment. For more than a century, we have stayed true to a core set of values – excellence, integrity, and respect for people – that guide us in all we do: discovering medicines that meet real needs, improving the understanding and management of disease, and giving back to communities through philanthropy and volunteerism.

Similar Jobs

Capital One Logo Capital One

Artificial Intelligence Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
4 Locations
55000 Employees
197K-246K Annually

Capital One Logo Capital One

Distinguished Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
4 Locations
55000 Employees
245K-335K Annually

Pfizer Logo Pfizer

Dermatology Senior Health and Science Specialist - Beverly Hills, CA

Artificial Intelligence • Healthtech • Machine Learning • Natural Language Processing • Biotech • Pharmaceutical
Remote or Hybrid
California, USA
121990 Employees
115K-222K Annually

Pfizer Logo Pfizer

Clinical Development Medical Director, Hematology

Artificial Intelligence • Healthtech • Machine Learning • Natural Language Processing • Biotech • Pharmaceutical
Hybrid
6 Locations
121990 Employees
240K-400K Annually

Similar Companies Hiring

Camber Thumbnail
Fintech • Healthtech • Social Impact
New York, New York
90 Employees
Sailor Health Thumbnail
Healthtech • Social Impact • Telehealth
New York City, NY
20 Employees
Granted Thumbnail
Mobile • Insurance • Healthtech • Financial Services • Artificial Intelligence
New York, New York
23 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account