DeepSource

Senior AI Infrastructure & Platform Engineer - Riyadh,KSA

Posted 3 Hours Ago

Be an Early Applicant

4 Locations

In-Office

Senior level

Artificial Intelligence • Computer Vision • Software

The Role

Designs, deploys, and optimizes GPU-accelerated AI/ML infrastructure and compute clusters. Manages Nvidia platform tooling (Base Command Manager, AI Enterprise, Operators, NIMs/Blueprints), scheduling with Slurm or Kubernetes, Linux administration, scripting/automation, monitoring, and performance tuning. Collaborates with data scientists and ML engineers to define resource allocation, CI/CD pipelines, and operational best practices while enforcing security and backup policies.

Summary Generated by Built In

Role Overview

We are seeking a highly skilled Senior AI Infrastructure & Platform Engineer to join our client’s team in Riyadh. In this role, you’ll be responsible for building, managing, and optimizing scalable AI infrastructure and compute environments that support high-performance workloads, including GPU-accelerated AI/ML pipelines, cluster scheduling, and orchestration.

Key Responsibilities

Deploy, maintain, and optimize GPU-based compute clusters and infrastructure.
Manage and operate GPU orchestration tools and platforms such as:

Nvidia Base Command Manager (critical)
Nvidia AI Enterprise Suite
Nvidia GPU and Network Operators
Nvidia NIMs and Blueprints

Configure, deploy, and maintain compute workloads using scheduling and orchestration tools including:

Slurm (critical)
Vanilla Kubernetes

Install, configure, and maintain the underlying OS (e.g. Canonical Ubuntu) and supporting system software.
Monitor and troubleshoot infrastructure performance, availability, and reliability; ensure high uptime for AI/ML workloads.
Work with data scientists, ML engineers, and dev teams to define infrastructure requirements, resource allocation, and deployment workflows.
Develop automation scripts, CI/CD pipelines, and best practices for infrastructure provisioning and management.
Document architecture, configurations, and operational procedures; enforce security, compliance, and backup policies.

RequirementsRequired Skills & Experience

Proven experience managing GPU-based AI/ML infrastructure and compute clusters.
Hands-on experience with:

Nvidia Base Command Manager
Nvidia AI Enterprise Suite
Nvidia GPU/Network Operators, NIMs, Blueprints

Strong experience with Slurm and/or Kubernetes orchestration.
Solid Linux system administration skills — preferably on Ubuntu or similar distributions.
Strong scripting/automation ability (e.g. Bash, Python, or relevant tooling) for provisioning, deployment, and maintenance.
Excellent troubleshooting and performance-tuning skills.
Experience collaborating with ML/data science teams and integrating infrastructure with their workflows.
Strong understanding of networking, security, resource allocation, and cluster management best practices.

Preferred Qualifications

Previous experience working in a high-performance computing (HPC) or AI-focused infrastructure team.
Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments.
Experience with CI/CD, infrastructure-as-code (e.g. Terraform, Ansible), monitoring tools, and logging setups.
Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.

Skills Required

Proven experience managing GPU-based AI/ML infrastructure and compute clusters.
Hands-on experience with Nvidia Base Command Manager.
Hands-on experience with Nvidia AI Enterprise Suite.
Hands-on experience with Nvidia GPU and Network Operators, NIMs, and Blueprints.
Strong experience with Slurm and/or Kubernetes orchestration.
Solid Linux system administration skills (preferably Ubuntu).
Strong scripting/automation ability (e.g., Bash, Python) for provisioning and deployment.
Excellent troubleshooting and performance-tuning skills.
Experience collaborating with ML/data science teams and integrating infrastructure with workflows.
Strong understanding of networking, security, resource allocation, and cluster management best practices.
Previous experience in a high-performance computing (HPC) or AI-focused infrastructure team.
Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments.
Experience with CI/CD and infrastructure-as-code tools (e.g., Terraform, Ansible), monitoring, and logging.
Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.

View all jobs at DeepSource

View DeepSource Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

Berlin

10 Employees

Year Founded: 2020

What We Do

DeepSource stands as a trusted partner for businesses seeking cutting-edge AI services in computer vision, natural language processing, and predictive analytics. With a particular focus on Arabic NLP and ChatGPT bot development, DeepSource is dedicated to empowering companies with groundbreaking solutions that streamline operations, optimize workflows, and enhance user experiences. Our commitment to excellence is evident in our approach to addressing a wide range of AI needs, from hiring top talent and managing end-to-end AI projects to providing tailored consulting and comprehensive training programs. DeepSource's team of experts is equipped with extensive knowledge and experience in various AI technologies, which enables them to develop and deploy advanced solutions across multiple industries. Our adaptive strategies and innovative methodologies allow businesses to stay competitive in today's rapidly evolving digital landscape