Sr Engineer -Compute

Reposted 18 Days Ago
Be an Early Applicant
Hiring Remotely in Gurugram, Haryana, IND
In-Office or Remote
Senior level
Cloud • Information Technology
The Role
Provide Tier 3 operational support for HPC compute clusters: incident/change management, firmware/software maintenance, performance assessment, cross-team troubleshooting, vendor escalation, customer communication, documentation, on-call rotation, and training/certification completion.
Summary Generated by Built In
AHEAD builds platforms for digital business. By weaving together advances in cloud infrastructure, automation and analytics, and software delivery, we help enterprises deliver on the promise of digital transformation.

At AHEAD, we prioritize creating a culture of belonging, where all perspectives and voices are represented, valued, respected, and heard. We create spaces to empower everyone to speak up, make change, and drive the culture at AHEAD. 

We are an equal opportunity employer, and do not discriminate based on an individual's race, national origin, color, gender, gender identity, gender expression, sexual orientation, religion, age, disability, marital status, or any other protected characteristic under applicable law, whether actual or perceived. 

We embrace all candidates that will contribute to the diversification and enrichment of ideas and perspectives at AHEAD. 

The High-Performance Computing Compute Engineer is primarily responsible for the overall health and maintenance of the physical cluster and server technologies in our managed services customer's environments. Our Compute Engineers are a valued member of the Managed Services Infrastructure Practice responsible for Tier 3 incident management, service request management and change management infrastructure support for all Managed Services customers.    

Principal Duties and Responsibilities 

  • Provide enterprise-level operational support to Managed Services customers for incident, problem, and change management activities 
  • Plan and perform software and firmware maintenance activities 
  • Assess customer environments for performance and design issues and propose resolutions 
  • Work across technical teams to troubleshoot complex infrastructure issues 
  • Create and maintain detailed documentation 
  • Serve as a subject matter expert and escalation point for compute technologies 
  • Work with vendors to resolve compute issues 
  • Communicate with customers and internal team with transparency 
  • Participate in on-call rotation 
  • Completion of training and certification as assigned to further skills and knowledge 

Education and Experience

  • Bachelor’s degree or equivalent Information Systems or related field. Unique education, specialized experience, skills, knowledge, training, or certification may be substituted for education 
  • 5+ years of advanced Linux administration and troubleshooting 
  • 5+ years managing RedHat OpenShift Kubernetes and Virtualization clusters 
  • 5+ years of expert level experience managing infrastructure in high-performance computing environments including configuration, troubleshooting, and best practice 
  • 2+ years of experience with Nvidia DGX preferred 
  • Experience with HPC schedulers (e.g., SLURM, Kubernetes, PBS, Run:ai) required 
  • Proficient in physical server environments 
  • Experience configuring, maintaining and troubleshooting containers 
  • Experience with storage technology (e.g., Ceph or Vast Data Platform) and distributed file systems (e.g., Lustre, GPFS, NFS, GlusterFS) 
  • Experience with machine learning or data science workflows in HPC/AI environments 
  • 1+ years working with monitoring platforms (e.g., Prometheus, Grafana); Elastic Observability experience is a bonus 
  • 1+ years working with an enterprise ITSM system: Service Now is a bonus 
  • Previous experience with automation tools such as Ansible, Puppet, or Chef a plus 
  • Managed Services or consulting experience is required 
  • Strong background with customer service 
  • High level problem-solving and communication skills 
  • Strong oral and written communications skills 
  • Related Linux, Nvidia, Scheduler, Containerization, Virtualization, and Clustering certifications are a bonus 

Why AHEAD:

Through our daily work and internal groups like Moving Women AHEAD and RISE AHEAD, we value and benefit from diversity of people, ideas, experience, and everything in between.

We fuel growth by stacking our office with top-notch technologies in a multi-million-dollar lab, by encouraging cross department training and development, sponsoring certifications and credentials for continued learning.

India Employment Benefits include: 
Comprehensive health insurance coverage for employees, with options to extend coverage to dependents
Paid time off and company holidays, along with additional leave benefits as per policy
Flexible work arrangements, supporting work-life balance
Learning and development opportunities to support continuous growth and upskilling
Employee wellness initiatives and programs focused on physical and mental well-being
Retirement and statutory benefits in line with India regulations
Inclusive and people-first culture, with a strong focus on collaboration and ownership

Top Skills

Ansible
Ceph
Chef
Containers
Elastic Observability
Glusterfs
Gpfs
Grafana
Kubernetes
Linux
Lustre
Nfs
Nvidia Dgx
Pbs
Prometheus
Puppet
Redhat Openshift
Run:Ai
Servicenow
Slurm
Vast Data Platform
Virtualization
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Chicago, IL
1,154 Employees
Year Founded: 2007

What We Do

AHEAD builds platforms for digital business. By weaving together cloud infrastructure, intelligent operations, and modern applications, we help enterprises deliver on the promise of digital transformation.

Similar Jobs

Boomi Logo Boomi

Senior Software Engineer

Cloud • Information Technology • Productivity • Software • Automation
Remote
India
2200 Employees

SailPoint Logo SailPoint

Sales Executive

Artificial Intelligence • Cloud • Sales • Security • Software • Cybersecurity • Data Privacy
Remote or Hybrid
India
2461 Employees

Motorola Solutions Logo Motorola Solutions

Devops Engineer

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Remote or Hybrid
India
23000 Employees

Motorola Solutions Logo Motorola Solutions

Business Systems Analyst

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Remote or Hybrid
India
23000 Employees

Similar Companies Hiring

Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account