Machine Learning Engineer, Platform

Reposted 6 Days Ago
Be an Early Applicant
2 Locations
In-Office
Mid level
Artificial Intelligence • Information Technology • Software
The Role
As a Machine Learning Engineer at AION, you'll design and implement LLMOps pipelines, optimize models, manage experiments, and collaborate on ML deployments. Your role will focus on refining LLMs through effective fine-tuning and evaluation strategies while ensuring production readiness and operational efficiency.
Summary Generated by Built In
About AION

AION is building an interoperable AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute and provides managed services, aiming to be an end-to-end AI lifecycle platform—taking organizations from data to deployed models using its forward-deployed engineering approach.

AI is transforming every business around the world, and the demand for compute is surging like never before. AION thrives to be the gateway for dynamic compute workloads by building integration bridges with diverse data centers around the world and re-inventing the compute stack via its state-of-the-art serverless technology. We stand at the crossroads where enterprises are finding it hard to balance AI adoption with security. At AION, we take enterprise security and compliance very seriously and are re-thinking every piece of infrastructure from hardware and network packets to API interfaces.

Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India/UK.

Who You Are

You're a hands-on ML engineer with 4-6 years of experience building and fine-tuning large language models (LLMs) and transformer-based models. You're execution-focused and thrive on solving challenging problems at the intersection of machine learning research and production systems.

You're comfortable working across the ML development lifecycle—from data preparation and model fine-tuning to evaluation and optimization. You understand both what makes a model perform well and how to systematically improve model quality through experimentation. Experience with LLM fine-tuning (LoRA, QLoRA), RLHF pipelines, and comprehensive model evaluation is highly desirable. You bring strong ownership, initiative, and the drive to build production-ready ML models that impact thousands of developers globally.


RequirementsWhat You'll Do

ML Model Development & Optimization

  • Design and implement end-to-end LLMOps pipelines for model training, fine-tuning, and evaluation
  • Fine-tune and customize LLMs (Llama, Mistral, Gemma, etc.) using full fine-tuning and PEFT techniques (LoRA, QLoRA) with tools like Unsloth, Axolotl, and HuggingFace Transformers
  • Implement RLHF (Reinforcement Learning from Human Feedback) pipelines for model alignment and preference optimization
  • Design experiments for automated hyperparameter tuning, training strategies, and model selection
  • Prepare and validate training datasets—ensuring data quality, preprocessing, and format correctness
  • Build comprehensive model evaluation systems with custom metrics (BLEU, ROUGE, perplexity, accuracy) and develop synthetic data generation pipelines
  • Optimize model accuracy, token efficiency, and training performance through systematic experimentation
  • Design and maintain prompt engineering workflows with version control systems
  • Deploy models using vLLM with multi-adapter LoRA serving, hot-swapping, and basic optimizations (speculative decoding, continuous batching, KV cache management)

ML Operations & Technical Leadership

  • Set up ML-specific monitoring for model quality, drift detection, and performance tracking with automated retraining triggers
  • Manage model versioning, artifact storage, lineage tracking, and reproducibility using experiment tracking tools
  • Debug production model issues and optimize cost-performance trade-offs for training and inference
  • Partner with infrastructure engineers on ML-specific compute requirements and deployment pipelines
  • Document model development processes and share knowledge through internal tech talks
Technical Skills & Experience

If you are meeting some of these requirements and feel comfortable catching up on others, we definitely recommend you to apply:

  • 4-6 years of hands-on experience in machine learning engineering or applied ML roles
  • Strong fine-tuning experience with modern LLMs—practical knowledge of transformer architectures, attention mechanisms, and both full fine-tuning and PEFT techniques (LoRA/QLoRA)
  • Deep understanding of transformer model architectures including modern variants (MoE, Grouped-Query Attention, Flash Attention, state space models)
  • Production ML experience—you've built and fine-tuned models for real-world applications
  • Proficiency in Python and ML frameworks (PyTorch, HuggingFace Transformers, PEFT, TRL) with hands-on experience in tools like Unsloth and Axolotl
  • Experience building model evaluation systems with metrics like BLEU, ROUGE, perplexity, and accuracy
  • Hands-on experience with prompt engineering, synthetic data generation, and data preprocessing pipelines
  • Basic deployment experience with vLLM including multi-adapter serving, hot-swapping, and inference optimizations
  • Understanding of GPU computing—memory management, multi-GPU training, mixed precision, gradient accumulation
  • Strong debugging skills for training failures, OOM errors, convergence issues, and data quality problems
  • Experience with model alignment techniques (RLHF, DPO) and implementing RLHF pipelines is highly desirable
  • Experience with distributed training (DeepSpeed, FSDP, DDP) is a plus
  • Knowledge of model quantization techniques (GPTQ, AWQ) and their impact on model quality is desirable
  • Prior experience with AWS SageMaker, MLflow for experiment tracking, and Weights & Biases is a strong plus
  • Exposure to cloud platforms (AWS/GCP/Azure) for training workloads is beneficial
  • Familiarity with Docker containerization for reproducible training environments
Preferred Attributes
  • High ownership, self driven and bias for action.
  • Strong strategic thinking and ability to connect technical decisions to business impact.
  • Excellent communication and mentoring skills.
  • Thrives in ambiguity, fast-paced environments, and early-stage startup culture.

Benefits

Why Join AION?

  • Work directly with high-pedigree founders shaping technical and product strategy.
  • Build infrastructure powering the future of AI compute globally.
  • Significant ownership and impact with equity reflective of your contributions.
  • Competitive compensation, flexible work options, and wellness benefits.

Apply Now:
If you’re a machine learning engineer ready to lead MLAAS(Machine learning as a Service) architecture and scale next-generation AI infrastructure, we want to hear from you. Please share the following in the summary section:

  • Your resume highlights relevant projects and leadership experience
  • Links to products, code(Github), or demos you’ve built.
  • A brief note on why AION’s mission excites you.

Top Skills

Aws Sagemaker
Axolotl
Docker
Huggingface Transformers
Python
PyTorch
Unsloth
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
21 Employees
Year Founded: 2023

What We Do

Everyday AI Platform: aion collapses the entire ai development lifecycle into a single, unified workspace. From data to deployment - everything at your fingertips. aion simplifies AI infrastructure the way Stripe simplified payments:

Plug-and-Play Multi-Provider Access
Customer Infrastructure Management
Deploy and optimize AI infrastructure via prompts with integrated cost tracking and performance analytics
Partner Sales & Resource Optimization

Track opportunities with confidential pricing, manage real-time inventory allocation, and monitor profitability from aion workloads

Similar Jobs

Easy Apply
In-Office
London, Greater London, England, GBR
774 Employees

Monzo Bank Logo Monzo Bank

Senior Platform Engineer

Fintech • Financial Services
Easy Apply
In-Office or Remote
2 Locations
2030 Employees
95K-130K Annually

Preply Logo Preply

Staff Machine Learning Platform/Ops Engineer

Edtech • Information Technology • Software
In-Office
London, England, GBR
700 Employees

Preply Logo Preply

Senior Machine Learning Platform/Ops Engineer

Edtech • Information Technology • Software
In-Office
London, England, GBR
700 Employees

Similar Companies Hiring

Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account