Compute Platform Engineer

Posted 3 Days Ago
Be an Early Applicant
Dallas, TX, USA
In-Office
Mid level
Artificial Intelligence • Cloud • Machine Learning • Infrastructure as a Service (IaaS)
The Role
Operate and improve large-scale HPC GPU/CPU platforms: manage firmware/BIOS lifecycles, troubleshoot hardware, automate health checks and IaC workflows, collaborate with vendors, perform capacity planning, and mentor junior engineers to ensure reliable, secure, scalable compute infrastructure.
Summary Generated by Built In

The Company

NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.

The Position

The Compute Platform Engineer role is responsible for the day-to-day reliability, performance, and operational health of our high-performance compute platforms that support critical research and production workloads. This position focuses on maintaining and troubleshooting CPU and GPU infrastructure, coordinating with vendors, and ensuring systems operate consistently at scale. Working closely with platform, infrastructure, and operations teams, the role plays a key part in sustaining a stable compute environment.

We are seeking a highly skilled and motivated Engineer to join our Compute Platform Management team. In this role, you will take ownership of the reliability and operational excellence of our high-performance computing infrastructure, which underpins our firm’s research and production workloads.

As a Compute Platform Engineer, you will be responsible for identifying and resolving hardware issues, coordinating with vendors and ensuring compute nodes (CPU and GPU) maintain peak performance. This contract role is ideal for someone who thrives in technically demanding environments and is eager to contribute to the continuous evolution of our compute platform.

Responsibilities:

  • Designing, configuring, and manage a High performance compute infrastructure made up of GPU and CPU nodes 

  • Manage the full firmware/BIOS lifecycle across our HPC/AI fleet – from baselines and validation through rollout and compliance. 

  • Troubleshoot hardware components (CPU, GPU, DPU, NVSwitch, NICs, memory, PSU, BMC) and guide replacement or configuration changes.  Diagnose and automate recurring hardware issues to improve reliability and reduce recovery time. 

  • Work on the latest AI platforms from day one (e.g., NVL72 / Grace Blackwell), ensuring they are stable, performant, and ready for production use. 

  • Monitoring hardware performance, identifying areas for improvement, and implementing solutions 

  • Automate health checks and onboarding workflows to accelerate safe deployment. 

  • Collaborate with vendors on firmware issues – providing clear repro cases, logs, and impact to drive fixes and improvements. 

  • Recommend process, tooling, and architectural improvements to strengthen platform operations. 

  • Performing diagnostics, tuning, and capacity planning to ensure smooth scale-out 

  • Performing analysis of existing hardware lifecycle processes and providing recommendations for improvement and optimization 

  • Collaborating with various teams to integrate hardware improvements and align with organizational goals 

  • Implementing best practices for security hardening of the platform and associated systems 

  • Mentoring junior engineers and fostering a culture of continuous learning and improvement 

  • Acting as a subject matter expert, providing guidance and support for infrastructure-related issues 

  • Leveraging Infrastructure as Code (IaC) methodologies to ensure efficient and scalable infrastructure management 

Requirements:

  • 3+ years of hands-on experience supporting large-scale compute platforms

  • Proficiency with HPE server infrastructure, such as ProLiant and Apollo, and NVIDIA GPUs, including A100 and H200

  • Solid understanding of server architecture, including UEFI/BIOS, PCIe devices and out-of-band management systems, such as iLO and BMC)

  • Proven ability to resolve complex hardware issues and manage vendor relationships

  • Familiarity with automation tools such as Ansible, Terraform and CI/CD systems 

  • Working knowledge of Linux in high-performance or latency-sensitive environments

  • Working knowledge of basic network concepts, such as DNS, DHCP, VLANs, switching and routing 

  • Basic working knowledge of Kubernetes and Openstack technologies (preferred but not required) 

  • Experience with data center operations and process adherence

  • Excellent communication and coordination skills with cross-functional teams and external partners

It is impossible to list every requirement for, or responsibility of, any position.  Similarly, we cannot identify all the skills a position may require since job responsibilities and the Company’s needs may change over time.  Therefore, the above job description is not comprehensive or exhaustive.  The Company reserves the right to adjust, add to or eliminate any aspect of the above description.  The Company also retains the right to require all employees to undertake additional or different job responsibilities when necessary to meet business needs.

Must be legally authorized to work in the United States without the need for employer sponsorship, now or at any time in the future.

Benefits & Perks:

  • Company-Paid Lunch Stipend: Lunch is provided via GrubHub

  • Company-Paid Benefits: 100% Employer-Paid Medical in our High Deductible Health Plan, Dental and Vision benefits for employees and their families, 16 weeks of Paid Parental Leave, Employee Assistance Program, Life insurance, Short-Term Disability and Long-Term Disability

  • 401(k): Company will match 100% of your contributions up to 6%

  • Optional Employee-Paid Benefits: Medical insurance in our PPO plan and a variety of other benefits such as Health Savings Accounts (with Company Contribution!), Flexible Spending Accounts, Supplemental Life Insurance, Wellhub and more.

  • Time Off:  25 days of Paid Time Off plus 12 company holidays

EQUAL OPPORTUNITY EMPLOYER

NORTHMARK STRATEGIES LLC IS AN EQUAL EMPLOYMENT OPPORTUNITY EMPLOYER. THE COMPANY'S POLICY IS NOT TO DISCRIMINATE AGAINST ANY APPLICANT OR EMPLOYEE BASED ON RACE, COLOR, RELIGION, NATIONAL ORIGIN, GENDER, AGE, SEXUAL ORIENTATION, GENDER IDENTITY OR EXPRESSION, MARITAL STATUS, MENTAL OR PHYSICAL DISABILITY, AND GENETIC INFORMATION, OR ANY OTHER BASIS PROTECTED BY APPLICABLE LAW. THE FIRM ALSO PROHIBITS HARASSMENT OF APPLICANTS OR EMPLOYEES BASED ON ANY OF THESE PROTECTED CATEGORIES.

Skills Required

  • 3+ years hands-on experience supporting large-scale compute platforms
  • Proficiency with HPE server infrastructure (ProLiant, Apollo) and NVIDIA GPUs (A100, H200)
  • Understanding of server architecture including UEFI/BIOS, PCIe devices, and out-of-band management (iLO, BMC)
  • Proven ability to resolve complex hardware issues and manage vendor relationships
  • Familiarity with automation tools such as Ansible, Terraform, and CI/CD systems
  • Working knowledge of Linux in high-performance or latency-sensitive environments
  • Working knowledge of basic network concepts (DNS, DHCP, VLANs, switching, routing)
  • Basic working knowledge of Kubernetes and OpenStack technologies
  • Experience with data center operations and process adherence
  • Excellent communication and coordination skills with cross-functional teams and external partners
  • Must be legally authorized to work in the United States without employer sponsorship
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
157 Employees

What We Do

NorthMark Strategies is a strategic capital firm that combines investment capital with engineering and technology to build enduring businesses. The firm operates a High-Performance Computing platform and supports simulation, AI/ML-enabled engineering and data-driven design to accelerate portfolio companies. NorthMark deploys capital, operates complex businesses, and builds infrastructure (including compute and cloud services) to drive long‑term innovation and operational outcomes.

Similar Jobs

BAE Systems, Inc. Logo BAE Systems, Inc.

Systems Engineer

Aerospace • Hardware • Information Technology • Security • Software • Cybersecurity • Defense
Hybrid
Fort Worth, TX, USA
40000 Employees
88K-150K Annually

Cloudflare Logo Cloudflare

Principal People Team Business Partner - M&A / Strategy

Cloud • Information Technology • Security • Software • Cybersecurity
Hybrid
2 Locations
4400 Employees
187K-257K Annually

Cloudflare Logo Cloudflare

Senior Software Engineer

Cloud • Information Technology • Security • Software • Cybersecurity
Hybrid
Austin, TX, USA
4400 Employees

BAE Systems, Inc. Logo BAE Systems, Inc.

Principal Supplier Quality Engineer

Aerospace • Hardware • Information Technology • Security • Software • Cybersecurity • Defense
Remote or Hybrid
Austin, TX, USA
40000 Employees
118K-201K Annually

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account