Comspark International Inc.

HPC Engineer/Architect

Reposted Yesterday

Be an Early Applicant

New York, NY, USA

In-Office

Expert/Leader

Information Technology • Professional Services • Software • Consulting

The Role

Support and operate large-scale Linux HPC clusters and parallel file systems (GPFS); deploy and maintain GPU-based compute infrastructure; manage and migrate job schedulers (LSF to Slurm); automate deployment, testing, CI/CD and user lifecycle with Ansible, Python and shell scripting; performance tune storage and networks (InfiniBand); consult researchers and maintain HPC benchmarks and monitoring.

Summary Generated by Built In

Job Title: HPC Engineer/Architect

Work Location: New York, NY 10001

Job Type: Contract

Work Type: Hybrid

Duration: 6+ Months

USC/GC/H4/H1

Job Summary:

You will support day-to-day operations of large-scale parallel file systems, deploy and maintain Linux HPC infrastructure across multiple data centers, and assist HPC engineers and architects with day-to-day operations and tickets.
Support day-to-day operations of large-scale parallel file systems
Deploy and Maintain Linux HPC infrastructure across multiple datacenters
Assist HPC engineers and architects with day-to-day operations and tickets

Experience:

16 to 20 years

Required Skills:

Linux Operating Systems (RHEL/CentOS), Parallel file system (GPFS), Job Scheduler LSF/Slrm
Anxible, Python, Shell scripting
GPU-based compute infrastructure (including CUDA)
CentOS 4.5
HPCC

Responsibilities:

Design, architect and oversee implementation of Linux based HPC clusters and storage
Deploy physical hardware using HPC deployment tools and configuration and orchestration tools (Ansible)
Parallel file system (GPFS) performance tuning, monitoring and troubleshooting
Perform systems benchmarking, and developing automated tests for the HPC environment, ensuring the reliability and efficiency of our computational infrastructure
Infiniband network maintenance and troubleshooting
Automate and monitor the HPC user lifecycle process
Slurm installation, configuration, performance tuning and troubleshooting
Plan, design and implement a transition from the LSF scheduler to Slurm
Manage the Slurm scheduler and translate Research policies into scheduler configurations
Consult with faculty and students to develop research pipelines for use on the HPC cluster
Develop and maintain user lifecycle software suite in Python, implement CI/CD pipeline
Test and automate upgrades of critical system applications using Ansible and shell scripts.
The ability to communicate effectively with clinicians, researchers, and other team members to develop technological solutions is key

Qualifications:

Experience working in a large-scale research based HPC environment
Proven experience working with distributed file storage solutions (i.e., GPFS)
Experience with deploying and troubleshooting Linux Operating Systems (RHEL/CentOS)
Experience with Scripting and Automation (Ansible, Python, Shell Scripting)
Solid understanding of job schedulers (LSF/SLURM)
Experience with GPU-based compute infrastructure (including CUDA)

Skills Required

16 to 20 years experience in HPC environments
Experience working in a large-scale research based HPC environment
Proven experience with parallel file systems (GPFS)
Experience deploying and troubleshooting Linux (RHEL/CentOS, CentOS 4.5)
Scripting and automation with Ansible, Python, and shell scripting
Solid understanding and experience with job schedulers (LSF and Slurm)
Experience with GPU-based compute infrastructure, including CUDA
Experience with HPCC systems
InfiniBand network maintenance and troubleshooting

View all jobs at Comspark International Inc.

View Comspark International Inc. Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

What We Do

Comspark International Inc. is a global software and engineering consulting firm specializing in the development and maintenance of web-enabled solutions. The company provides a wide range of cutting-edge IT services, including software development, outsourcing, and enterprise application solutions such as ERP, AI, and blockchain. They assist clients across various industry domains by creating and implementing specialized technology solutions to drive business efficiency.