Aerones

MLOps Engineer

Reposted 5 Days Ago

Be an Early Applicant

Rīga

In-Office

3K-6K Annually

Mid level

Robotics

The Role

The MLOps Engineer will own the ML pipeline for computer vision, design containerized workflows, maintain orchestration for pipelines, and enhance model governance and CI/CD processes.

Summary Generated by Built In

We are seeking a talented MLOps Engineer to take full ownership of the AI pipelines that power our computer vision models. You’ll design and operate a containerized training/inference stack that runs both on a local GPU workstation cluster (multiple workstations with multiple GPUs each) and in Google Cloud. Your mission is to streamline the entire model lifecycle—from data ingestion and feature build, through training, evaluation, packaging, deployment, and monitoring—so researchers and engineers can iterate quickly and ship reliable models to production.

You will build robust orchestration and observability around our pipelines, implement resource-aware scheduling for heterogeneous queues, and lead the rollout of model/experiment tracking and performance analytics. You’ll also own the evolution of our documentation to ensure the platform is easy to understand, extend, and support.

Responsibilities

Own the end-to-end ML pipeline for computer vision: data prep, training, evaluation, model packaging, artifact/version management, deployment, and monitoring (local GPU cluster + GCP).
Design and maintain containerized workflows for multi-GPU training and distributed workloads (e.g., PyTorch DDP, Ray, or similar).
Build and operate orchestration (e.g., Airflow/Argo/Kubeflow/Ray Jobs) for scheduled and on-demand pipelines across on-prem and cloud.
Implement and tune resource allocation strategies based on current and upcoming task queues (GPU/CPU/memory-aware scheduling; preemption/priority; autoscaling).
Introduce and integrate monitoring/telemetry for:

job health and failure analysis (retry, backoff, alerts),
data/feature drift and model performance (precision/recall, latency, throughput),
infra metrics (GPU utilization, memory, I/O, cost).

Harden GCP environments (permissions, networks, registries, storage) and optimize for reliability, performance, and cost (spot/managed instance groups, autoscaling).
Establish model governance: experiment tracking, model registry, promotion gates, rollbacks, and audit trails.
Standardize CI/CD for ML (data/feature pipelines, model builds, tests, and canary/blue-green rollouts).
Collaborate with CV researchers/engineers to productionize new models and improve training throughput & inference SLAs.
Continuously improve documentation: update existing pipeline docs and produce concise runbooks, diagrams, and “how-to” guides.

Requirements

Hands-on MLOps experience building and running ML pipelines at scale (preferably computer vision) across on-prem GPUs and a public cloud (GCP preferred).
Strong with Docker and Docker Compose in local and cloud environments; solid understanding of image build optimization and artifact caching.
GitLab CI/CD expertise (modular templates, YAML optimization, build/test stages for ML, environment promotion).
Proficiency with Python and Bash for pipeline tooling, glue code, and automation; Terraform for infra-as-code (GCP resources, IAM, networking, storage).
Experience with orchestration: one or more of Airflow, Argo Workflows, Kubeflow, Ray, or Prefect.
Experience operating GPU workloads: NVIDIA driver/CUDA stack, container runtimes, device plugins (k8s), multi-GPU training, utilization tuning.
Observability & monitoring for ML and infra: Prometheus/Grafana, OpenTelemetry/Loki (or similar) for metrics, logs, traces; alerting and SLOs.
Experiment tracking / model registry with tools like MLflow or Weights & Biases (runs, params, artifacts, metrics, registry/promotion).
Data versioning & validation: DVC/lakeFS (or similar), Great Expectations/whylogs, schema checks, drift detection.
Cloud services: GCP (Compute Engine, GKE or Autopilot, Cloud Run, Artifact Registry, Cloud Storage, Pub/Sub). Equivalent AWS/Azure experience is acceptable.
Security & compliance for ML stacks: secrets management, SBOM/image scanning, least-privilege IAM, network policies, artifact signing.
Solid understanding of containerized deployment patterns (blue-green/canary), rollout strategies, and rollback safety.

Good to Have

Kubernetes & Helm in production; NVIDIA GPU Operator, node labeling/taints, and MIG partitioning.
Ray/Dask for distributed training/inference and hyperparameter sweeps.
Feature stores (e.g., Feast) and streaming features (Pub/Sub/Kafka).
Inference serving frameworks: TorchServe, Triton Inference Server, FastAPI + Uvicorn/Gunicorn, or Vertex AI endpoints.
Batch & real-time pipelines: Apache Beam/Dataflow, Spark, or Flink.
Cost optimization playbook on GCP: preemptibles/spot, autoscaling policies, right-sizing, per-project budget alerts.
Testing for ML: pytest fixtures for data/model tests, golden datasets, regression tests, property-based tests.
Experience with service proxies (Traefik/Nginx), DNS management, certificate management, and SSL/TLS automation.
Familiarity with Edge/embedded deployments for CV models a plus.

Benefits

We believe great work starts with feeling valued and supported. That’s why we are building an thoughtful, competitive benefits and perks to help you thrive — professionally and personally — through every step of your Career with us. You will be eligible for:

Salary from 2,500 EUR to 5,500 EUR per month (before Taxes)
A Birthday Gift

After Probationary Period

Health Insurance
Health Recovery Days (which can be taken as you need)
Paid Study Leave
Funding for the purchase of Vision Glasses after one (1) year of service

Join us in Building a Cleaner, Smarter Future — one quality process improvement at a time.

Top Skills

Ai Pipelines

Apache Beam

Bash

Containerized Training

Docker

Dvc

Flink

GCP

Gitlab Ci/Cd

GCP

Grafana

Helm

Kubernetes

Mlflow

Mlops

Opentelemetry

Orchestration

Prometheus

Python

Spark

Terraform

View all jobs at Aerones

View Aerones Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: San Jose, CA

133 Employees

Year Founded: 2018

What We Do

Aerones is an innovative company that has developed robotic technology for wind turbine blade maintenance services, such as:
• Conductivity measurements and trouble-shooting;
• Drainage hole cleaning;
• External inspection of the wind turbine blades;
• Internal inspection of the blades;
• Blade & Tower cleaning;
• Coating application on the leading edges;
• Leading-edge repair.

The technology in use is controlled remotely. In addition, it is compact and easily transportable.
Aerones is the first company in the world to provide the services using robotic technology: the maintenance process does not require technicians to work in dangerous heights, and thus is much safer, more efficient, and the downtime of the turbines is decreased significantly.