Data & ML Ops

Posted 14 Days Ago
Be an Early Applicant
Madinah, SAU
In-Office
Senior level
Cloud • eCommerce • Information Technology • Software
The Role
The role involves designing, scaling, and securing production workloads across Kubernetes clusters, optimizing performance, ensuring reliability and availability, and implementing CI/CD and observability practices.
Summary Generated by Built In

We are looking for a Senior Site Reliability Engineer (SRE) to help design, scale, and secure our rapidly growing platform infrastructure.
You will work across all critical systems — from customer-facing applications and APIs to internal platforms and data services — ensuring availability, performance, and cost efficiency at scale.

You’ll be hands-on with Kubernetes, observability, GitOps, automation, and cloud infrastructure, while partnering closely with application, platform, and data teams to deliver a highly reliable and self-healing environment.

This role is ideal for an engineer who thrives on complex distributed systems, loves to automate everything, and can balance speed, stability, and cost-efficiency in production.

  • Bachelor’s degree in Computer Science, Engineering, or a related field — or equivalent work experience.
  • Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS/AKS/GKE) clusters.
  • Build self-healing, auto-scaling systems that minimize manual intervention and ensure uptime.
  • Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) within Kubernetes environments.
  • Implement backup, disaster recovery, replication, and failover strategies to meet RPO/RTO targets.
  • Troubleshoot and recover Kubernetes Persistent Volumes (StorageClasses, CSI drivers, PVC issues).
  • Optimize storage performance and cost through multi-tier strategies, hot/cold data separation, and S3/offloading lifecycle policies.
  • Secure and scale object storage platforms (e.g., MinIO/S3-compatible) for high-throughput data pipelines.
  • Manage block storage (EBS/io2/gp3) and shared file systems (EFS, NFS) for resilience and cost balance.
  • Collaborate with teams to optimize networking, ingress/egress traffic, and service mesh for secure communication.
Platform & Infrastructure Reliability
  • Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS/AKS/GKE) clusters.
  • Build self-healing, auto-scaling systems that minimize toil and manual intervention.
  • Optimize networking, ingress/egress traffic control, and service mesh for secure & performant communication.
  • Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) in Kubernetes environments.
  • Own backup, disaster recovery, replication, and failover strategies to meet RPO/RTO targets for critical data services.
  • Optimize storage performance and cost through multi-tier strategies, hot/cold data separation, and S3/offloading lifecycle policies.
  • Troubleshoot and recover Kubernetes Persistent Volumes confidently during incidents (StorageClasses, CSI drivers, PVC issues).
  • Secure and scale object storage platforms (e.g., MinIO/S3-compatible) and integrate with workloads for high-throughput data pipelines.
  • Work with block storage (EBS/io2/gp3) and shared file systems (EFS, NFS) to balance performance, resiliency, and cost.
Automation & Delivery
  • Champion GitOps and CI/CD best practices (ArgoCD, Flux, GitHub Actions).
    Build automation for infrastructure provisioning and upgrades using Terraform, Helm, and Kubernetes Operators.
  • Reduce release risk through progressive delivery strategies (blue/green, canary, spot instance rolling updates).
Observability & Incident Response
  • Own the monitoring and alerting stack (Prometheus, Grafana, Loki, VictoriaMetrics, OpenSearch).
  • Lead incident management and postmortems to prevent recurrence.
  • Provide real-time visibility into system health, performance, and cost metrics.
Security & Compliance
  • Implement least-privilege IAM policies, secure service-to-service communication, and network ACLs/firewalls.
  • Enforce Kubernetes RBAC, secret management, and secure image supply chain.
  • Participate in audit readiness and compliance efforts.
Performance & Cost Optimization
  • Analyze and tune system performance under scale (CPU/memory/IO).
  • Partner with product and platform teams to right-size clusters, databases, and storage tiers.

Introduce cost visibility dashboards for engineering leadership.

Preferred Qualifications
  • Experience managing mission-critical systems at scale (high traffic, multi-region).
  • Proven cost optimization in cloud/K8s environments.
  • Familiarity with service mesh (Istio, Linkerd) or advanced networking/egress control.
  • Experience with data platform components (Airflow, Debezium, ClickHouse, etc.) is a plus but not required.

Strong communication skills and teamworker — able to collaborate across engineering, DevOps, security, and product teams.


Requirements
  • 8+ years in SRE / DevOps / Infrastructure Engineering roles.
  • Deep Kubernetes expertise (multi-cluster, Helm chart development, advanced networking).
  • Strong GitOps workflows using ArgoCD/Flux.
  • Expertise with AWS (preferred) or Azure/GCP, plus Infrastructure-as-Code (Terraform, Pulumi, CloudFormation).
  • Advanced knowledge of SQL & NoSQL databases (MySQL/Aurora, PostgreSQL, MongoDB, Redis).
  • Scripting/automation skills in Python, Bash, or Go.
  • Solid background in monitoring/observability (Prometheus, Grafana, Loki, ELK/Opensearch, VictoriaMetrics).
  • Experience with CI/CD at scale and managing production incidents.
  • Experience with streaming/messaging (Kafka, RabbitMQ, or similar).

Benefits
  • Comprehensive Training & Development programs.
  • Performance-based Bonus incentives.
  • Flexible Work From Home options.

Skills Required

  • Bachelor's degree in Computer Science, Engineering, or a related field
  • 8+ years in SRE / DevOps / Infrastructure Engineering roles
  • Deep Kubernetes expertise
  • Strong GitOps workflows using ArgoCD/Flux
  • Expertise with AWS or Azure/GCP
  • Advanced knowledge of SQL & NoSQL databases
  • Scripting/automation skills in Python, Bash, or Go
  • Solid background in monitoring/observability tools
  • Experience with CI/CD at scale
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
0 Employees
Year Founded: 2016

What We Do

Salla is the leading commerce platform in the GCC, built in Saudi Arabia, providing tools and services for merchants to build, run, and grow their online stores.

Similar Jobs

CrowdStrike Logo CrowdStrike

Regional Sales Manager

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
Saudi Arabia
10000 Employees

Capco Logo Capco

Information Technology Business Analyst

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Remote or Hybrid
10 Locations
6000 Employees

CrowdStrike Logo CrowdStrike

Consultant

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
Saudi Arabia
10000 Employees

CrowdStrike Logo CrowdStrike

Regional Sales Manager

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
Saudi Arabia
10000 Employees

Similar Companies Hiring

Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account