Senior Kubernetes Engineer

Posted 2 Days Ago
Be an Early Applicant
Dallas, TX, USA
In-Office
Senior level
Artificial Intelligence • Cloud • Machine Learning • Infrastructure as a Service (IaaS)
The Role
Design, implement, and operate GPU-accelerated Kubernetes clusters for HPC/AI workloads. Build custom operators/controllers, integrate NVIDIA device plugins and MIG, optimize scheduling and GPU utilisation, implement observability and security policies, maintain GitOps CI/CD and infrastructure-as-code, and support performance tuning and incident response.
Summary Generated by Built In

The Company

NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.

The Position

We are seeking a highly skilled Senior Kubernetes Engineer to join our NMC2 office in Dallas.
In this role, you will design, implement, and optimise GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments.
You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.

Responsibilities

  • Architecting and operating Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM

  • Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services

  • Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer

  • Optimising GPU utilisation and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano

  • Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance

  • Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry

  • Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper

  • Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD

  • Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize

  • Participating in performance tuning, incident response and production readiness reviews

Requirements

  • Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG and DCGM

  • Proficiency in Go or Python for operator development and Kubernetes controller logic

  • Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers and scheduler extensions

  • Experience with GPU-intensive workloads, for example for LLMs, training pipelines and scientific computing

  • Hands-on experience with Helm, Kustomize and GitOps workflows

  • Familiarity with CNI plugins, especially NVIDIA CNI and Multus

  • Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter

It is impossible to list every requirement for, or responsibility of, any position.  Similarly, we cannot identify all the skills a position may require since job responsibilities and the Company’s needs may change over time.  Therefore, the above job description is not comprehensive or exhaustive.  The Company reserves the right to adjust, add to or eliminate any aspect of the above description.  The Company also retains the right to require all employees to undertake additional or different job responsibilities when necessary to meet business needs.

Must be legally authorized to work in the United States without the need for employer sponsorship, now or at any time in the future.

Benefits & Perks:

  • Company-Paid Lunch Stipend: Lunch is provided via GrubHub

  • Company-Paid Benefits: 100% Employer-Paid Medical in our High Deductible Health Plan, Dental and Vision benefits for employees and their families, 16 weeks of Paid Parental Leave, Employee Assistance Program, Life insurance, Short-Term Disability and Long-Term Disability

  • 401(k): Company will match 100% of your contributions up to 6%

  • Optional Employee-Paid Benefits: Medical insurance in our PPO plan and a variety of other benefits such as Health Savings Accounts (with Company Contribution!), Flexible Spending Accounts, Supplemental Life Insurance, Wellhub and more.

  • Time Off:  25 days of Paid Time Off plus 12 company holidays

EQUAL OPPORTUNITY EMPLOYER

NORTHMARK STRATEGIES LLC IS AN EQUAL EMPLOYMENT OPPORTUNITY EMPLOYER. THE COMPANY'S POLICY IS NOT TO DISCRIMINATE AGAINST ANY APPLICANT OR EMPLOYEE BASED ON RACE, COLOR, RELIGION, NATIONAL ORIGIN, GENDER, AGE, SEXUAL ORIENTATION, GENDER IDENTITY OR EXPRESSION, MARITAL STATUS, MENTAL OR PHYSICAL DISABILITY, AND GENETIC INFORMATION, OR ANY OTHER BASIS PROTECTED BY APPLICABLE LAW. THE FIRM ALSO PROHIBITS HARASSMENT OF APPLICANTS OR EMPLOYEES BASED ON ANY OF THESE PROTECTED CATEGORIES.

Skills Required

  • Extensive experience with Kubernetes in production including NVIDIA GPU Operator, device plugin, NVML, MIG and DCGM
  • Proficiency in Go or Python for operator development and Kubernetes controller logic
  • Deep understanding of Kubernetes internals including CRDs, RBAC, custom controllers and scheduler extensions
  • Experience with GPU-intensive workloads such as LLM training, ML pipelines or scientific computing
  • Hands-on experience with Helm, Kustomize and GitOps workflows
  • Familiarity with CNI plugins, especially NVIDIA CNI and Multus
  • Experience monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter
  • Experience developing, deploying and maintaining custom Kubernetes operators and controllers
  • Experience with scheduler extensions and job placement tools such as kube-scheduler plugins, Slurm or Volcano
  • Experience with CI/CD for Kubernetes infrastructure using GitOps tools like ArgoCD or FluxCD and infrastructure-as-code (Terraform)
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
157 Employees

What We Do

NorthMark Strategies is a strategic capital firm that combines investment capital with engineering and technology to build enduring businesses. The firm operates a High-Performance Computing platform and supports simulation, AI/ML-enabled engineering and data-driven design to accelerate portfolio companies. NorthMark deploys capital, operates complex businesses, and builds infrastructure (including compute and cloud services) to drive long‑term innovation and operational outcomes.

Similar Jobs

CrowdStrike Logo CrowdStrike

Infrastructure Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
USA
10000 Employees
140K-215K Annually

Capital One Logo Capital One

Lead Software Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
5 Locations
55000 Employees
209K-286K Annually

SpaceX Logo SpaceX

Sr. Kubernetes Engineer

Aerospace • Other
In-Office
Star, TX, USA
8879 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account