Software Engineer, HPC Scheduling

Posted 3 Days Ago
Dallas, TX, USA
In-Office
Mid level
Artificial Intelligence • Cloud • Machine Learning • Infrastructure as a Service (IaaS)
The Role
Build and maintain scalable, distributed HPC scheduling systems on Kubernetes (focus on Armada). Develop in Golang, operate containerized workloads, optimize data interactions (Postgres), monitoring (Prometheus/Grafana), messaging (Kafka/Pulsar), networking, and CI/CD while troubleshooting Linux-based infrastructure at scale.
Summary Generated by Built In

The Company

NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.

The Position

The HPC Scheduling team develops and manages a large high-performance compute (HPC) platform to enable the business to conduct complex research at scale. We are seeking a highly motivated person to join our team to help us continue to push the envelope running batch workloads on Kubernetes.

The ideal candidate will have an active interest in Kubernetes and batch computing, a broad range of experience with software engineering and development, as well as experience managing large-scale infrastructure and complex tooling environments.

The main focus will be on Armada - an exciting open source CNCF project built and maintained by the team - which we use to solve multi-cluster Kubernetes batch job scheduling at scale.

You’ll join an experienced team, working at the cutting-edge of ML workloads and at scale.

Responsibilities

  • Designing and developing high-quality software solutions using procedural programming languages, with a focus on Golang
  • Building and maintaining highly scalable, highly available and globally distributed systems to support large-scale research workloads
  • Managing and optimising data interactions across relational and non-relational databases, particularly PostgreSQL
  • Developing and operating containerised applications within Kubernetes, ensuring effective orchestration and workload scheduling
  • Supporting, tuning and troubleshooting Linux-based systems as part of our core compute platform
  • Applying core networking knowledge to help debug, optimise and enhance platform connectivity and performance
  • Independently diagnosing and resolving complex technical issues across infrastructure and software layers
  • Applying solid software architecture principles, computer science fundamentals and data structure knowledge to guide design decisions and code quality
  • Driving continuous improvement by contributing to CI/CD pipelines and engineering best practices
  • Staying up to date with emerging technologies and approaches, and applying new knowledge across disciplines

Requirements

  • Experience with developing Kubernetes components, such as controllers and operators
  • Experience with event-driven programming and message queues, such as apache Kafka and Pulsar
  • Experience of high-performance computing, Kubernetes, or DAG (Directed Acyclic Graph) workflows
  • Experience of running systems at scale using a cloud provider, ideally AWS
  • Use of operational and runtime tools and practices, including monitoring and logging with systems such as Prometheus and Grafana
  • Experience of operating or using job scheduling systems, such as SLURM

It is impossible to list every requirement for, or responsibility of, any position.  Similarly, we cannot identify all the skills a position may require since job responsibilities and the Company’s needs may change over time.  Therefore, the above job description is not comprehensive or exhaustive.  The Company reserves the right to adjust, add to or eliminate any aspect of the above description.  The Company also retains the right to require all employees to undertake additional or different job responsibilities when necessary to meet business needs.

Must be legally authorized to work in the United States without the need for employer sponsorship, now or at any time in the future.

Benefits & Perks:

  • Company-Paid Lunch Stipend: Lunch is provided via GrubHub

  • Company-Paid Benefits: 100% Employer-Paid Medical in our High Deductible Health Plan, Dental and Vision benefits for employees and their families, 16 weeks of Paid Parental Leave, Employee Assistance Program, Life insurance, Short-Term Disability and Long-Term Disability

  • 401(k): Company will match 100% of your contributions up to 6%

  • Optional Employee-Paid Benefits: Medical insurance in our PPO plan and a variety of other benefits such as Health Savings Accounts (with Company Contribution!), Flexible Spending Accounts, Supplemental Life Insurance, Wellhub and more.

  • Time Off:  25 days of Paid Time Off plus 12 company holidays

EQUAL OPPORTUNITY EMPLOYER

NORTHMARK STRATEGIES LLC IS AN EQUAL EMPLOYMENT OPPORTUNITY EMPLOYER. THE COMPANY'S POLICY IS NOT TO DISCRIMINATE AGAINST ANY APPLICANT OR EMPLOYEE BASED ON RACE, COLOR, RELIGION, NATIONAL ORIGIN, GENDER, AGE, SEXUAL ORIENTATION, GENDER IDENTITY OR EXPRESSION, MARITAL STATUS, MENTAL OR PHYSICAL DISABILITY, AND GENETIC INFORMATION, OR ANY OTHER BASIS PROTECTED BY APPLICABLE LAW. THE FIRM ALSO PROHIBITS HARASSMENT OF APPLICANTS OR EMPLOYEES BASED ON ANY OF THESE PROTECTED CATEGORIES.

Skills Required

  • Proficient in Golang
  • Developing Kubernetes components such as controllers and operators
  • Experience with event-driven programming and message queues (Apache Kafka, Pulsar)
  • Experience with high-performance computing, Kubernetes, or DAG workflows
  • Experience running systems at scale using a cloud provider (ideally AWS)
  • Experience with monitoring and logging tools such as Prometheus and Grafana
  • Experience operating or using job scheduling systems such as SLURM
  • Managing and optimizing data interactions across databases, particularly PostgreSQL
  • Experience supporting, tuning, and troubleshooting Linux-based systems
  • Developing and operating containerized applications and Kubernetes workload scheduling
  • Contributing to CI/CD pipelines and engineering best practices
  • Core networking knowledge for debugging and performance optimization
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
157 Employees

What We Do

NorthMark Strategies is a strategic capital firm that combines investment capital with engineering and technology to build enduring businesses. The firm operates a High-Performance Computing platform and supports simulation, AI/ML-enabled engineering and data-driven design to accelerate portfolio companies. NorthMark deploys capital, operates complex businesses, and builds infrastructure (including compute and cloud services) to drive long‑term innovation and operational outcomes.

Similar Jobs

Aceable Logo Aceable

Sr Regulatory Manager

eCommerce • Edtech • Insurance • Mobile • Real Estate • Software
Easy Apply
Remote or Hybrid
USA
140 Employees
105K-140K Annually

Acquia Logo Acquia

Director, Product Marketing, Platform & Drupal

AdTech • Cloud • Marketing Tech • Productivity • Software • Analytics • Automation
Easy Apply
Remote or Hybrid
United States
1100 Employees
174K-200K Annually

Pluralsight Logo Pluralsight

Senior Director of Curriculum

Edtech • Information Technology • Software
Remote or Hybrid
USA
1000 Employees
190K-250K Annually

Take-Two Interactive Software Logo Take-Two Interactive Software

Business Analyst

Gaming • Information Technology • Mobile • Software
Hybrid
Austin, TX, USA
13000 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account