Devops & SysOps Architect

Reposted Yesterday
Be an Early Applicant
Cairo, EGY
In-Office
Senior level
Software
The Role
This role combines SysOps and DevOps, focusing on client-facing presales and technical execution. The candidate will design and maintain complex GPU and HPC environments, optimize Kubernetes clusters, and lead architecture decisions for robust solutions, requiring significant experience in Linux systems and infrastructure engineering.
Summary Generated by Built In

We are seeking an exceptional Senior Lead who combines deep hands-on SysOps/HPC expertise with the strategic vision of a solution architect. This is a rare dual-track role: you operate at the intersection of elite technical execution and client-facing presales, designing and running mission-critical GPU, HPC, and Kubernetes platforms while simultaneously co-creating opportunity with our commercial teams.


This role carries both SysOps, HPC depth and DevOps. You are expected to spend at least 60% of your time on implementation and technical execution

What You Will Do

Presales & Business Development

•       Partner with sales and solution teams to identify and qualify new opportunities

•       Lead or support technical presales activities: discovery workshops, RFP responses, architecture presentations

•       Build and deliver proof-of-concepts (POCs) that demonstrate platform capabilities to prospective clients

•       Prepare high-quality technical materials

•       Act as a trusted technical advisor during client conversations, proposing solutions aligned to business goals


In-Account Delivery — SysOps & DevOps Execution

•       Operate directly within client accounts as a senior SysOps/DevOps engineer

•       Run, troubleshoot, and optimize production-grade Kubernetes clusters and GPU/HPC environments hands-on

•       Own Linux system administration at a deep level: kernel tuning, storage, networking, performance profiling

•       Implement and maintain IaC pipelines, GitOps workflows, and CI/CD systems

•       Serve as the senior escalation point for complex operational incidents within accounts


Architecture & Solution Design

•       Design end-to-end platform architectures spanning cloud, hybrid, and on-premises HPC environments

•       Define workload isolation models, networking architectures, and storage strategies for multi-tenant platforms

•       Recommend and validate technology choices aligned to client scale, budget, and team maturity

•       Produce architecture decision records (ADRs), solution blueprints, and technical runbooks

Technical Competencies & Requirements

1. Architecture & System Design

•       Design production-grade multi-cluster Kubernetes platforms:

◦       RKE2, EKS (AWS), AKS (Azure) at enterprise scale

◦       GPU-aware clusters: NVIDIA H100 / A100 / B200 node pools

◦       Hybrid cloud + on-premises HPC infrastructure

•       Define and document:

◦       Workload isolation: namespaces, MIG partitioning, multi-tenancy models

◦       Networking: BGP peering, Ingress controllers, service mesh (Istio / Cilium)

◦       Storage: Longhorn, Ceph, distributed and high-throughput file systems


2. Platform Engineering & GitOps Strategy

•       Define and enforce platform standards across the delivery lifecycle

•       GitOps tooling: ArgoCD, Fleet — declarative cluster management

•       CI/CD pipelines: Azure DevOps, Jenkins — build, test, promote

•       Infrastructure as Code: Terraform (modules, remote state, workspaces), Ansible

•       Standardize cluster bootstrapping, app deployment lifecycle, environment promotion (Dev → QA → Prod)


3. AI / GPU Infrastructure Architecture  (Priority Competency)

•       Design and operate GPU compute platforms at scale:

◦       GPU Operator deployment and lifecycle management

◦       MIG (Multi-Instance GPU) partitioning for multi-tenant workloads

◦       Advanced scheduling: Run:AI, Kubernetes-native GPU scheduling (device plugins)

•       Understand AI workload classes and their infrastructure implications:

◦       Distributed training workloads (data/model/pipeline parallelism)

◦       Inference pipelines — NVIDIA Triton Inference Server, TensorRT optimization

•       Align infrastructure to the full AI stack:

◦       CUDA stack, cuDNN, NCCL collective communication libraries

◦       High-speed networking: InfiniBand (HDR/NDR), RoCE for RDMA

◦       GPUDirect RDMA / GPUDirect Storage for low-latency data paths


4. Observability & Reliability Engineering

•       Define and implement full-stack observability:

◦       Metrics: Prometheus, Thanos (long-term retention, multi-cluster)

◦       Logs: Loki, Fluent Bit

◦       GPU telemetry: DCGM Exporter, NVIDIA Nsight Systems

•       Build operational frameworks:

◦       SLO / SLA definitions and error budget tracking

◦       Alerting strategy — noise reduction, severity routing

◦       Incident response playbooks and on-call runbooks


5. Security & Multi-Tenancy Architecture

•       Design zero-trust security postures for multi-tenant platforms

•       Secret management: HashiCorp Vault, External Secrets Operator

•       Identity and access: IAM, RBAC, SSO/OIDC integration

•       Network isolation: NetworkPolicy, micro-segmentation, mTLS

•       Secure GPU sharing: MIG isolation, VGPU licensing, tenant boundary enforcement


6. HPC, Data & Storage Architecture  (Priority Competency)

•       Understand the high-performance storage for AI/HPC workloads:

◦       GPUDirect Storage — bypassing CPU for GPU-native I/O

◦       Distributed file systems: Weka (high-throughput NFS/S3), Ceph (scalable object/block)

◦       Storage tiering, caching strategies, and data lifecycle management

•       Size and validate storage architectures against workload I/O profiles


7. Operational Leadership & Linux Systems

•       Lead incident response and root cause analysis (RCA) for critical production issues

•       Define upgrade strategies, change management procedures, and disaster recovery plans

•       Write and maintain runbooks, operational playbooks, and knowledge base content

•       Integrate organizational processes, compliance requirements, and security policies into operational frameworks

•       Deep Linux expertise:

◦       Kernel tuning (CPU governor, NUMA, IRQ affinity, hugepages)

◦       Storage I/O scheduling, NVMe optimization

◦       Network stack tuning for RDMA / InfiniBand

◦       System performance profiling and bottleneck analysis


Candidate Profile — Who You Are

•       you are comfortable running production systems.

•       You have stronger SysOps and HPC depth than DevOps breadth, and you embrace that identity

•       You can shift fluidly between running a live incident, presenting an architecture to a CTO, and reviewing a POC demo environment

•       You communicate technical complexity clearly — to engineers and to C-level stakeholders

•       You understand why specific tooling choices matter (not just how to configure them) and can articulate trade-offs in presales conversations

•       You are comfortable owning outcomes across both commercial (presales) and delivery (operations) dimensions

•       You thrive in ambiguity and can scope both short POCs and long-horizon platform programs


Requirements

Required

•       10+ years in platform/infrastructure engineering, with at least 2 years in architect-level role

•       Proven hands-on experience operating Kubernetes at scale in production (multi-cluster, multi-tenant)

•       Significant Linux systems administration experience — kernel, networking, storage at a low level

•       HPC and/or GPU infrastructure experience — physical GPU servers, NCCL, InfiniBand, or high-speed fabrics

•       Demonstrable presales or client-facing experience

•       IaC experience: Terraform and/or Ansible in production environments

•       Strong understanding of GitOps and CI/CD pipelines in enterprise settings


Strongly Preferred

•       Experience with NVIDIA GPU Operator, MIG partitioning, Run:AI, or equivalent GPU scheduling tooling

•       Knowledge of distributed AI training infrastructure (PyTorch DDP, Horovod, DeepSpeed) from an infrastructure perspective

•       Familiarity with NVIDIA Triton Inference Server or TensorRT deployment pipelines

•       Experience with Weka, Ceph, or GPUDirect Storage in HPC/AI environments

•       Hands-on experience with Vault, External Secrets, and zero-trust network architectures

•       Exposure to bare-metal provisioning and HPC cluster management (Slurm, PBS, or equivalent)


Certifications (Advantageous)

•       CKA / CKS (Certified Kubernetes Administrator / Security Specialist)

•       RHCE / RHCA (Red Hat Certified Engineer / Architect)

•       AWS Solutions Architect / Azure Solutions Architect Expert

•       HashiCorp Terraform Associate or Vault Associate

•       NVIDIA DLI certifications (GPU computing, AI infrastructure)


Benefits
  • Why Integrant?
  • Competitive compensation package
  • PTO, full medical and dental coverage, etc.
  • Opportunity to travel and work onsite with U.S. customers
  • In-house Technical and English training programs
  • Dedicated learning time (check out our 4Plus1 Program) [link]
  • Interest free loans
  • Flexible work schedules
  • Perks: events, sponsored lunch, game area, rooftop hangout + more!

Skills Required

  • 10+ years in platform/infrastructure engineering
  • At least 2 years in architect-level role
  • Experience operating Kubernetes at scale in production
  • Significant Linux systems administration experience
  • Experience with HPC and/or GPU infrastructure
  • Demonstrable presales or client-facing experience
  • IaC experience: Terraform and/or Ansible in production environments
  • Strong understanding of GitOps and CI/CD pipelines
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Diego, CA
263 Employees
Year Founded: 1992

What We Do

Integrant, Inc. is a custom software development company focused on providing tailor made software solutions to fit your needs to a tee. We strive to uncover your pain points and identify how our team can seamlessly integrate with you and your business for a one-team approach. Our guiding principle is to always do the right thing for our customers and employees. Some days this means happy news of a “hit on the mark” demo, successful launch, or challenging problem solved. Other days this means making hard decisions, asking tough questions, or working more than we planned. Every day, it means doing our best to provide the highest quality service to each of our customers. We do that by investing our people in you and inspiring a people-to-people connection so when we say, “we share your goals,” we truly mean it.

Similar Jobs

Ericsson Logo Ericsson

BO IMS Engineer

Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
In-Office or Remote
2 Locations
88000 Employees

Mastercard Logo Mastercard

Senior Analyst - Dispute Resolution Management (EEMEA)

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Hybrid
Cairo, EGY
38800 Employees

Mastercard Logo Mastercard

Manager, Specialist Sales - SIP - IC

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Hybrid
Cairo, EGY
38800 Employees

Mondelēz International Logo Mondelēz International

Sales Intern

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Hybrid
Cairo, EGY
90000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account