REF Digital

Site Reliability Engineer (HPC / AI Infrastructure) | Bengaluru, India | On-Site

Posted Yesterday

Be an Early Applicant

Bengaluru, Bengaluru Urban, Karnataka, IND

In-Office

Mid level

Agency • eCommerce • Design • SEO

The Role

Operate and optimise large-scale GPU/HPC clusters and networks, drive automation and IaC, improve monitoring and self-healing, tune server/storage/network performance, build automation tools (Python/Golang/Bash), enhance security, and participate in 24x7 on-call to support AI/ML training workflows.

Summary Generated by Built In

Location: Bengaluru, India (On-site – mandatory)
Employment Type: Full-time
Industry: AI / High-Performance Computing / Autonomous Systems
Start Date: ASAP

About the Role

We are partnering with a global technology leader building next-generation AI, machine learning, and high-performance computing (HPC) infrastructure supporting autonomous systems and advanced robotics.

We are seeking a Site Reliability Engineer (SRE) – HPC / AI Infrastructure to maintain and optimise large-scale GPU clusters, high-throughput networks, and distributed compute environments that power neural network training at scale.

This is a mission-critical role focused on reliability, automation, performance optimisation, and infrastructure scalability within complex AI/ML environments.

Key Responsibilities

Support and operate large-scale AI/ML cluster infrastructure on GPU platforms
Drive automation, configuration management, and scalable infrastructure deployment
Improve monitoring, alerting, and self-healing systems
Optimise server, storage, and network performance
Develop automation tools using Python, Golang, or Bash/Shell
Implement Infrastructure as Code (IaC) best practices
Enhance security posture across compute environments
Participate in 24x7 on-call rotation
Collaborate closely with AI/ML engineering teams to streamline neural network training workflows

Required Profile

Strong proficiency in Linux fundamentals and performance tuning
Experience with HPC workload schedulers (e.g., Slurm, LSF)
Experience managing parallel file systems and high-performance storage
Proficiency in Python, Golang, and/or Bash
Hands-on experience with configuration management tools (e.g., Ansible)
Experience with monitoring and observability tools (Prometheus, Grafana, Splunk, etc.)
Familiarity with containerisation technologies such as Kubernetes
Experience with GPU-based computing systems and high-throughput, low-latency networks is highly desirable
Bachelor’s degree in Computer Science, Engineering, Physics, or equivalent practical expertise
3+ years of relevant experience in site reliability, DevOps, or infrastructure engineering

What’s on Offer

Opportunity to work on cutting-edge AI and high-performance computing systems
Exposure to large-scale GPU clusters and distributed compute environments
High-impact engineering role supporting mission-critical AI initiatives
Collaborative, performance-driven technical culture
Competitive compensation and growth opportunities

Why Join

This is an opportunity to operate at the core of AI infrastructure powering advanced autonomous and machine learning systems. Ideal for engineers passionate about reliability engineering, automation, and optimising high-scale compute environments.

Skills Required

Strong proficiency in Linux fundamentals and performance tuning
Experience with HPC workload schedulers (Slurm, LSF)
Experience managing parallel file systems and high-performance storage
Proficiency in Python, Golang, and/or Bash
Hands-on experience with configuration management tools (Ansible)
Experience with monitoring and observability tools (Prometheus, Grafana, Splunk)
Familiarity with containerisation technologies such as Kubernetes
Experience with GPU-based computing systems and high-throughput, low-latency networks
Implement Infrastructure as Code (IaC) best practices
Bachelor's degree in Computer Science, Engineering, Physics, or equivalent practical expertise
3+ years relevant experience in site reliability, DevOps, or infrastructure engineering

View all jobs at REF Digital

View REF Digital Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

0 Employees

What We Do

REF Digital is a Montreal-based digital agency that helps businesses thrive in a digital-first economy. They specialize in designing and engineering bespoke e-commerce platforms, apps, and digital experiences engineered for lasting impact. Formed from the digital team of Groupe LG2, the agency combines strategy, technology, and design to help brands navigate the digital economy and propel their online presence to a new level.