Senior AI Infrastructure & Platform Engineer - Riyadh,KSA

Posted 3 Hours Ago
Be an Early Applicant
4 Locations
In-Office
Senior level
Artificial Intelligence • Computer Vision • Software
The Role
Designs, deploys, and optimizes GPU-accelerated AI/ML infrastructure and compute clusters. Manages Nvidia platform tooling (Base Command Manager, AI Enterprise, Operators, NIMs/Blueprints), scheduling with Slurm or Kubernetes, Linux administration, scripting/automation, monitoring, and performance tuning. Collaborates with data scientists and ML engineers to define resource allocation, CI/CD pipelines, and operational best practices while enforcing security and backup policies.
Summary Generated by Built In
Role Overview

We are seeking a highly skilled Senior AI Infrastructure & Platform Engineer to join our client’s team in Riyadh. In this role, you’ll be responsible for building, managing, and optimizing scalable AI infrastructure and compute environments that support high-performance workloads, including GPU-accelerated AI/ML pipelines, cluster scheduling, and orchestration.

Key Responsibilities
  • Deploy, maintain, and optimize GPU-based compute clusters and infrastructure.
  • Manage and operate GPU orchestration tools and platforms such as:
    • Nvidia Base Command Manager (critical)
    • Nvidia AI Enterprise Suite
    • Nvidia GPU and Network Operators
    • Nvidia NIMs and Blueprints
  • Configure, deploy, and maintain compute workloads using scheduling and orchestration tools including:
    • Slurm (critical)
    • Vanilla Kubernetes
  • Install, configure, and maintain the underlying OS (e.g. Canonical Ubuntu) and supporting system software.
  • Monitor and troubleshoot infrastructure performance, availability, and reliability; ensure high uptime for AI/ML workloads.
  • Work with data scientists, ML engineers, and dev teams to define infrastructure requirements, resource allocation, and deployment workflows.
  • Develop automation scripts, CI/CD pipelines, and best practices for infrastructure provisioning and management.
  • Document architecture, configurations, and operational procedures; enforce security, compliance, and backup policies.

RequirementsRequired Skills & Experience
  • Proven experience managing GPU-based AI/ML infrastructure and compute clusters.
  • Hands-on experience with:
    • Nvidia Base Command Manager
    • Nvidia AI Enterprise Suite
    • Nvidia GPU/Network Operators, NIMs, Blueprints
  • Strong experience with Slurm and/or Kubernetes orchestration.
  • Solid Linux system administration skills — preferably on Ubuntu or similar distributions.
  • Strong scripting/automation ability (e.g. Bash, Python, or relevant tooling) for provisioning, deployment, and maintenance.
  • Excellent troubleshooting and performance-tuning skills.
  • Experience collaborating with ML/data science teams and integrating infrastructure with their workflows.
  • Strong understanding of networking, security, resource allocation, and cluster management best practices.
Preferred Qualifications
  • Previous experience working in a high-performance computing (HPC) or AI-focused infrastructure team.
  • Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments.
  • Experience with CI/CD, infrastructure-as-code (e.g. Terraform, Ansible), monitoring tools, and logging setups.
  • Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.

Skills Required

  • Proven experience managing GPU-based AI/ML infrastructure and compute clusters.
  • Hands-on experience with Nvidia Base Command Manager.
  • Hands-on experience with Nvidia AI Enterprise Suite.
  • Hands-on experience with Nvidia GPU and Network Operators, NIMs, and Blueprints.
  • Strong experience with Slurm and/or Kubernetes orchestration.
  • Solid Linux system administration skills (preferably Ubuntu).
  • Strong scripting/automation ability (e.g., Bash, Python) for provisioning and deployment.
  • Excellent troubleshooting and performance-tuning skills.
  • Experience collaborating with ML/data science teams and integrating infrastructure with workflows.
  • Strong understanding of networking, security, resource allocation, and cluster management best practices.
  • Previous experience in a high-performance computing (HPC) or AI-focused infrastructure team.
  • Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments.
  • Experience with CI/CD and infrastructure-as-code tools (e.g., Terraform, Ansible), monitoring, and logging.
  • Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Berlin
10 Employees
Year Founded: 2020

What We Do

DeepSource stands as a trusted partner for businesses seeking cutting-edge AI services in computer vision, natural language processing, and predictive analytics. With a particular focus on Arabic NLP and ChatGPT bot development, DeepSource is dedicated to empowering companies with groundbreaking solutions that streamline operations, optimize workflows, and enhance user experiences. Our commitment to excellence is evident in our approach to addressing a wide range of AI needs, from hiring top talent and managing end-to-end AI projects to providing tailored consulting and comprehensive training programs. DeepSource's team of experts is equipped with extensive knowledge and experience in various AI technologies, which enables them to develop and deploy advanced solutions across multiple industries. Our adaptive strategies and innovative methodologies allow businesses to stay competitive in today's rapidly evolving digital landscape

Similar Jobs

Capco Logo Capco

Scrum Master

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Remote or Hybrid
10 Locations
6000 Employees

Ericsson Logo Ericsson

Chief Technology Officer

Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
In-Office
Amman, JOR
88000 Employees
10-10 Annually

Capco Logo Capco

Information Technology Business Analyst

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Remote or Hybrid
10 Locations
6000 Employees
In-Office
Amman, JOR
32902 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account