MLOps Engineer

Reposted 5 Days Ago
Be an Early Applicant
Rīga
In-Office
3K-6K Annually
Mid level
Robotics
The Role
The MLOps Engineer will own the ML pipeline for computer vision, design containerized workflows, maintain orchestration for pipelines, and enhance model governance and CI/CD processes.
Summary Generated by Built In

We are seeking a talented MLOps Engineer to take full ownership of the AI pipelines that power our computer vision models. You’ll design and operate a containerized training/inference stack that runs both on a local GPU workstation cluster (multiple workstations with multiple GPUs each) and in Google Cloud. Your mission is to streamline the entire model lifecycle—from data ingestion and feature build, through training, evaluation, packaging, deployment, and monitoring—so researchers and engineers can iterate quickly and ship reliable models to production.

You will build robust orchestration and observability around our pipelines, implement resource-aware scheduling for heterogeneous queues, and lead the rollout of model/experiment tracking and performance analytics. You’ll also own the evolution of our documentation to ensure the platform is easy to understand, extend, and support.

Responsibilities
  • Own the end-to-end ML pipeline for computer vision: data prep, training, evaluation, model packaging, artifact/version management, deployment, and monitoring (local GPU cluster + GCP).
  • Design and maintain containerized workflows for multi-GPU training and distributed workloads (e.g., PyTorch DDP, Ray, or similar).
  • Build and operate orchestration (e.g., Airflow/Argo/Kubeflow/Ray Jobs) for scheduled and on-demand pipelines across on-prem and cloud.
  • Implement and tune resource allocation strategies based on current and upcoming task queues (GPU/CPU/memory-aware scheduling; preemption/priority; autoscaling).
  • Introduce and integrate monitoring/telemetry for:
    • job health and failure analysis (retry, backoff, alerts),
    • data/feature drift and model performance (precision/recall, latency, throughput),
    • infra metrics (GPU utilization, memory, I/O, cost).
  • Harden GCP environments (permissions, networks, registries, storage) and optimize for reliability, performance, and cost (spot/managed instance groups, autoscaling).
  • Establish model governance: experiment tracking, model registry, promotion gates, rollbacks, and audit trails.
  • Standardize CI/CD for ML (data/feature pipelines, model builds, tests, and canary/blue-green rollouts).
  • Collaborate with CV researchers/engineers to productionize new models and improve training throughput & inference SLAs.
  • Continuously improve documentation: update existing pipeline docs and produce concise runbooks, diagrams, and “how-to” guides.

Requirements
  • Hands-on MLOps experience building and running ML pipelines at scale (preferably computer vision) across on-prem GPUs and a public cloud (GCP preferred).
  • Strong with Docker and Docker Compose in local and cloud environments; solid understanding of image build optimization and artifact caching.
  • GitLab CI/CD expertise (modular templates, YAML optimization, build/test stages for ML, environment promotion).
  • Proficiency with Python and Bash for pipeline tooling, glue code, and automation; Terraform for infra-as-code (GCP resources, IAM, networking, storage).
  • Experience with orchestration: one or more of Airflow, Argo Workflows, Kubeflow, Ray, or Prefect.
  • Experience operating GPU workloads: NVIDIA driver/CUDA stack, container runtimes, device plugins (k8s), multi-GPU training, utilization tuning.
  • Observability & monitoring for ML and infra: Prometheus/Grafana, OpenTelemetry/Loki (or similar) for metrics, logs, traces; alerting and SLOs.
  • Experiment tracking / model registry with tools like MLflow or Weights & Biases (runs, params, artifacts, metrics, registry/promotion).
  • Data versioning & validation: DVC/lakeFS (or similar), Great Expectations/whylogs, schema checks, drift detection.
  • Cloud services: GCP (Compute Engine, GKE or Autopilot, Cloud Run, Artifact Registry, Cloud Storage, Pub/Sub). Equivalent AWS/Azure experience is acceptable.
  • Security & compliance for ML stacks: secrets management, SBOM/image scanning, least-privilege IAM, network policies, artifact signing.
  • Solid understanding of containerized deployment patterns (blue-green/canary), rollout strategies, and rollback safety.
Good to Have
  • Kubernetes & Helm in production; NVIDIA GPU Operator, node labeling/taints, and MIG partitioning.
  • Ray/Dask for distributed training/inference and hyperparameter sweeps.
  • Feature stores (e.g., Feast) and streaming features (Pub/Sub/Kafka).
  • Inference serving frameworks: TorchServe, Triton Inference Server, FastAPI + Uvicorn/Gunicorn, or Vertex AI endpoints.
  • Batch & real-time pipelines: Apache Beam/Dataflow, Spark, or Flink.
  • Cost optimization playbook on GCP: preemptibles/spot, autoscaling policies, right-sizing, per-project budget alerts.
  • Testing for ML: pytest fixtures for data/model tests, golden datasets, regression tests, property-based tests.
  • Experience with service proxies (Traefik/Nginx), DNS management, certificate management, and SSL/TLS automation.
  • Familiarity with Edge/embedded deployments for CV models a plus.

Benefits

We believe great work starts with feeling valued and supported. That’s why we are building an thoughtful, competitive benefits and perks to help you thrive — professionally and personally — through every step of your Career with us. You will be eligible for:

  • Salary from 2,500 EUR to 5,500 EUR per month (before Taxes)
  • A Birthday Gift

After Probationary Period

  • Health Insurance
  • Health Recovery Days (which can be taken as you need)
  • Paid Study Leave
  • Funding for the purchase of Vision Glasses after one (1) year of service

Join us in Building a Cleaner, Smarter Future — one quality process improvement at a time.

Top Skills

Ai Pipelines
Apache Beam
Bash
Containerized Training
Docker
Dvc
Flink
GCP
Gitlab Ci/Cd
GCP
Grafana
Helm
Kubernetes
Mlflow
Mlops
Opentelemetry
Orchestration
Prometheus
Python
Spark
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Jose, CA
133 Employees
Year Founded: 2018

What We Do

Aerones is an innovative company that has developed robotic technology for wind turbine blade maintenance services, such as:
• Conductivity measurements and trouble-shooting;
• Drainage hole cleaning;
• External inspection of the wind turbine blades;
• Internal inspection of the blades;
• Blade & Tower cleaning;
• Coating application on the leading edges;
• Leading-edge repair.

The technology in use is controlled remotely. In addition, it is compact and easily transportable.
Aerones is the first company in the world to provide the services using robotic technology: the maintenance process does not require technicians to work in dangerous heights, and thus is much safer, more efficient, and the downtime of the turbines is decreased significantly.

Similar Jobs

WeLocalize Logo WeLocalize

Shape the Future of AI — Latvian Talent Hub

Machine Learning • Natural Language Processing
In-Office or Remote
35 Locations
2331 Employees

Mastercard Logo Mastercard

Marketing Analyst

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Hybrid
Rīga, LVA
35300 Employees
2K-2K Annually

Mastercard Logo Mastercard

IT Project Management Specialist, Dynamic Yield

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Hybrid
Rīga, LVA
35300 Employees
2K-2K Annually

Mastercard Logo Mastercard

Operations Manager

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Remote or Hybrid
Rīga, LVA
35300 Employees
2K-2K Annually

Similar Companies Hiring

Carbon Robotics Thumbnail
Software • Robotics • Machine Learning • Hardware • Computer Vision • Artificial Intelligence • Agriculture
Seattle, WA
280 Employees
Apptronik Thumbnail
Software • Robotics • Machine Learning • Hardware • Computer Vision
Austin, TX
180 Employees
Doodle Labs Thumbnail
Wearables • Robotics • Internet of Things • Hardware • Automation • App development • Aerospace
SG
50 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account