We are seeking a talented MLOps Engineer to take full ownership of the AI pipelines that power our computer vision models. You’ll design and operate a containerized training/inference stack that runs both on a local GPU workstation cluster (multiple workstations with multiple GPUs each) and in Google Cloud. Your mission is to streamline the entire model lifecycle—from data ingestion and feature build, through training, evaluation, packaging, deployment, and monitoring—so researchers and engineers can iterate quickly and ship reliable models to production.
You will build robust orchestration and observability around our pipelines, implement resource-aware scheduling for heterogeneous queues, and lead the rollout of model/experiment tracking and performance analytics. You’ll also own the evolution of our documentation to ensure the platform is easy to understand, extend, and support.
Responsibilities- Own the end-to-end ML pipeline for computer vision: data prep, training, evaluation, model packaging, artifact/version management, deployment, and monitoring (local GPU cluster + GCP).
- Design and maintain containerized workflows for multi-GPU training and distributed workloads (e.g., PyTorch DDP, Ray, or similar).
- Build and operate orchestration (e.g., Airflow/Argo/Kubeflow/Ray Jobs) for scheduled and on-demand pipelines across on-prem and cloud.
- Implement and tune resource allocation strategies based on current and upcoming task queues (GPU/CPU/memory-aware scheduling; preemption/priority; autoscaling).
- Introduce and integrate monitoring/telemetry for:
- job health and failure analysis (retry, backoff, alerts),
- data/feature drift and model performance (precision/recall, latency, throughput),
- infra metrics (GPU utilization, memory, I/O, cost).
- Harden GCP environments (permissions, networks, registries, storage) and optimize for reliability, performance, and cost (spot/managed instance groups, autoscaling).
- Establish model governance: experiment tracking, model registry, promotion gates, rollbacks, and audit trails.
- Standardize CI/CD for ML (data/feature pipelines, model builds, tests, and canary/blue-green rollouts).
- Collaborate with CV researchers/engineers to productionize new models and improve training throughput & inference SLAs.
- Continuously improve documentation: update existing pipeline docs and produce concise runbooks, diagrams, and “how-to” guides.
Requirements
- Hands-on MLOps experience building and running ML pipelines at scale (preferably computer vision) across on-prem GPUs and a public cloud (GCP preferred).
- Strong with Docker and Docker Compose in local and cloud environments; solid understanding of image build optimization and artifact caching.
- GitLab CI/CD expertise (modular templates, YAML optimization, build/test stages for ML, environment promotion).
- Proficiency with Python and Bash for pipeline tooling, glue code, and automation; Terraform for infra-as-code (GCP resources, IAM, networking, storage).
- Experience with orchestration: one or more of Airflow, Argo Workflows, Kubeflow, Ray, or Prefect.
- Experience operating GPU workloads: NVIDIA driver/CUDA stack, container runtimes, device plugins (k8s), multi-GPU training, utilization tuning.
- Observability & monitoring for ML and infra: Prometheus/Grafana, OpenTelemetry/Loki (or similar) for metrics, logs, traces; alerting and SLOs.
- Experiment tracking / model registry with tools like MLflow or Weights & Biases (runs, params, artifacts, metrics, registry/promotion).
- Data versioning & validation: DVC/lakeFS (or similar), Great Expectations/whylogs, schema checks, drift detection.
- Cloud services: GCP (Compute Engine, GKE or Autopilot, Cloud Run, Artifact Registry, Cloud Storage, Pub/Sub). Equivalent AWS/Azure experience is acceptable.
- Security & compliance for ML stacks: secrets management, SBOM/image scanning, least-privilege IAM, network policies, artifact signing.
- Solid understanding of containerized deployment patterns (blue-green/canary), rollout strategies, and rollback safety.
- Kubernetes & Helm in production; NVIDIA GPU Operator, node labeling/taints, and MIG partitioning.
- Ray/Dask for distributed training/inference and hyperparameter sweeps.
- Feature stores (e.g., Feast) and streaming features (Pub/Sub/Kafka).
- Inference serving frameworks: TorchServe, Triton Inference Server, FastAPI + Uvicorn/Gunicorn, or Vertex AI endpoints.
- Batch & real-time pipelines: Apache Beam/Dataflow, Spark, or Flink.
- Cost optimization playbook on GCP: preemptibles/spot, autoscaling policies, right-sizing, per-project budget alerts.
- Testing for ML: pytest fixtures for data/model tests, golden datasets, regression tests, property-based tests.
- Experience with service proxies (Traefik/Nginx), DNS management, certificate management, and SSL/TLS automation.
- Familiarity with Edge/embedded deployments for CV models a plus.
Benefits
We believe great work starts with feeling valued and supported. That’s why we are building an thoughtful, competitive benefits and perks to help you thrive — professionally and personally — through every step of your Career with us. You will be eligible for:
- Salary from 2,500 EUR to 5,500 EUR per month (before Taxes)
- A Birthday Gift
After Probationary Period
- Health Insurance
- Health Recovery Days (which can be taken as you need)
- Paid Study Leave
- Funding for the purchase of Vision Glasses after one (1) year of service
Join us in Building a Cleaner, Smarter Future — one quality process improvement at a time.
Top Skills
What We Do
Aerones is an innovative company that has developed robotic technology for wind turbine blade maintenance services, such as:
• Conductivity measurements and trouble-shooting;
• Drainage hole cleaning;
• External inspection of the wind turbine blades;
• Internal inspection of the blades;
• Blade & Tower cleaning;
• Coating application on the leading edges;
• Leading-edge repair.
The technology in use is controlled remotely. In addition, it is compact and easily transportable.
Aerones is the first company in the world to provide the services using robotic technology: the maintenance process does not require technicians to work in dangerous heights, and thus is much safer, more efficient, and the downtime of the turbines is decreased significantly.







