LLM Data Engineer | United States | Fully Remote

Posted 21 Days Ago
Hiring Remotely in United States
Remote
Mid level
Digital Media • Software
The Role
The LLM Data Engineer will design, implement, and maintain data pipelines for Generative AI platforms, focusing on techniques like Supervised Fine Tuning and Reinforcement Learning from Human Feedback. Responsibilities include data source integration, optimizing workflows, managing various vector store technologies, and collaborating with teams to ensure data quality for AI/ML models.
Summary Generated by Built In

Description

We are seeking an experienced AI/LLM Data Engineer to build and maintain the data pipeline for our Generative AI platform. The ideal candidate will be well-versed in the latest Large Language Model (LLM) technologies and have a strong background in data engineering, with a focus on Retrieval-Augmented Generation (RAG) and knowledge-base techniques. This role sits in the AI COE within DX Tech & Digital. As a AI/LLM Data Engineer (you will report into the Director, AI Solutions & Development who oversees the AI COE. 

You will work on highly visible strategic projects, collaborating with cross-functional teams 

to define requirements and deliver high-quality AI solutions. 

The ideal candidate will have a passion for Generative AI and LLMs, with a proven track record of delivering innovative AI applications.

Responsibilities 
• Design, implement, and maintain an end-to-end multi-stage data pipeline for LLMs, including Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) data processes 
• Identify, evaluate, and integrate diverse data sources and domains to support the Generative AI platform 
• Develop and optimize data processing workflows for chunking, indexing, ingestion, and vectorization for both text and non-text data 
• Benchmark and implement various vector stores, embedding techniques, and retrieval methods 
• Create a flexible pipeline supporting multiple embedding algorithms, vector stores, and search types (e.g., vector search, hybrid search) 
• Implement and maintain auto-tagging systems and data preparation processes for LLMs 
• Develop tools for text and image data crawling, cleaning, and refinement 
• Collaborate with cross-functional teams to ensure data quality and relevance for AI/ML models 
• Work with data lake house architectures to optimize data storage and processing 
• Integrate and optimize workflows using Snowflake and various vector store technologies 

Requirements

• Master's degree in Computer Science, Data Science, or a related field 
• 3-5 years of work experience in data engineering, preferably in AI/ML contexts 
• Proficiency in Python, JSON, HTTP, and related tools 
• Strong understanding of LLM architectures, training processes, and data requirements 
• Experience with RAG systems, knowledge base construction, and vector databases 
• Familiarity with embedding techniques, similarity search algorithms, and information retrieval concepts 
• Hands-on experience with data cleaning, tagging, and annotation processes (both manual and automated) 
• Knowledge of data crawling techniques and associated ethical considerations 
• Strong problem-solving skills and ability to work in a fast-paced, innovative environment 
• Familiarity with Snowflake and its integration in AI/ML pipelines 
• Experience with various vector store technologies and their applications in AI 
• Understanding of data lakehouse concepts and architectures 
• Excellent communication, collaboration, and problem-solving skills. 
• Ability to translate business needs into technical solutions. 
• Passion for innovation and a commitment to ethical AI development. 
• Experience building LLMs pipeline using framework like LangChain, LlamaIndex, Semantic Kernel, OpenAI functions.
• Familiar with different LLM parameters like temperate, top-k, and repeat penalty, and different LLM outcome evaluation data science metrics and methodologies. 

Preferred Skills

  • Experience with popular LLM/ RAG frameworks
  • Familiarity with distributed computing platforms (e.g., Apache Spark, Dask) 
  • Knowledge of data versioning and experiment tracking tools 
  • Experience with cloud platforms (AWS, GCP, or Azure) for large-scale data processing 
  • Understanding of data privacy and security best practices 
  • Practical experience implementing data lakehouse solutions 
  • Proficiency in optimizing queries and data processes in Snowflake or Databricks
  • Hands-on experience with different vector store technologies
Benefits
  • US employees benefit package.

Top Skills

Python
The Company
HQ: New York, New York
223 Employees
On-site Workplace
Year Founded: 2006

What We Do

Halo believes in innovation by inclusion to solve digital problems. Our interdisciplinary teams of adventurous designers, developers and entrepreneurial minds explore to discover together. Our variety of backgrounds, viewpoints, and skills connect to solve business challenges of every shape and size. Founded in 2006 as an international agency specializing in interactive media strategy and development, we embrace curiosity and passion in a serious way. We believe in betterment through partnership, forming impactful collaborations with our clients so they can do the same with their audience. Working at Halo feels like belonging on the playground as we take an intellectual journey together.

Similar Jobs

Cash App Logo Cash App

Risk and Controls Analyst, Cash App

Blockchain • Fintech • Mobile • Payments • Software • Financial Services
Remote
Hybrid
Chicago, IL, USA
3500 Employees
89K-168K Annually

Cedar Logo Cedar

Data Scientist III (Artificial Intelligence)

Fintech • Healthtech • Software
Easy Apply
Remote
United States
340 Employees

eClinical Solutions Logo eClinical Solutions

Senior Database Engineer (Remote)

Cloud • Healthtech • Professional Services • Software • Pharmaceutical
Easy Apply
Remote
United States

Capital One Logo Capital One

Sr. Product Operations Analyst - Capital One Software (Remote)

Fintech • Machine Learning • Payments • Software • Financial Services
Remote
Hybrid
McLean, VA, USA
55000 Employees
97K-111K Annually

Similar Companies Hiring

Jobba Trade Technologies, Inc. Thumbnail
Software • Professional Services • Productivity • Information Technology • Cloud
Chicago, IL
45 Employees
RunPod Thumbnail
Software • Infrastructure as a Service (IaaS) • Cloud • Artificial Intelligence
Charlotte, North Carolina
53 Employees
Hedra Thumbnail
Software • News + Entertainment • Marketing Tech • Generative AI • Enterprise Web • Digital Media • Consumer Web
San Francisco, CA
14 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account