Site Reliability Engineer (HPC / AI Infrastructure) | Bengaluru, India | On-Site

Posted Yesterday
Be an Early Applicant
Bengaluru, Bengaluru Urban, Karnataka, IND
In-Office
Mid level
Agency • eCommerce • Design • SEO
The Role
Operate and optimise large-scale GPU/HPC clusters and networks, drive automation and IaC, improve monitoring and self-healing, tune server/storage/network performance, build automation tools (Python/Golang/Bash), enhance security, and participate in 24x7 on-call to support AI/ML training workflows.
Summary Generated by Built In

Location: Bengaluru, India (On-site – mandatory)
Employment Type: Full-time
Industry: AI / High-Performance Computing / Autonomous Systems
Start Date: ASAP

About the Role

We are partnering with a global technology leader building next-generation AI, machine learning, and high-performance computing (HPC) infrastructure supporting autonomous systems and advanced robotics.

We are seeking a Site Reliability Engineer (SRE) – HPC / AI Infrastructure to maintain and optimise large-scale GPU clusters, high-throughput networks, and distributed compute environments that power neural network training at scale.

This is a mission-critical role focused on reliability, automation, performance optimisation, and infrastructure scalability within complex AI/ML environments.

Key Responsibilities
  • Support and operate large-scale AI/ML cluster infrastructure on GPU platforms

  • Drive automation, configuration management, and scalable infrastructure deployment

  • Improve monitoring, alerting, and self-healing systems

  • Optimise server, storage, and network performance

  • Develop automation tools using Python, Golang, or Bash/Shell

  • Implement Infrastructure as Code (IaC) best practices

  • Enhance security posture across compute environments

  • Participate in 24x7 on-call rotation

  • Collaborate closely with AI/ML engineering teams to streamline neural network training workflows

Required Profile
  • Strong proficiency in Linux fundamentals and performance tuning

  • Experience with HPC workload schedulers (e.g., Slurm, LSF)

  • Experience managing parallel file systems and high-performance storage

  • Proficiency in Python, Golang, and/or Bash

  • Hands-on experience with configuration management tools (e.g., Ansible)

  • Experience with monitoring and observability tools (Prometheus, Grafana, Splunk, etc.)

  • Familiarity with containerisation technologies such as Kubernetes

  • Experience with GPU-based computing systems and high-throughput, low-latency networks is highly desirable

  • Bachelor’s degree in Computer Science, Engineering, Physics, or equivalent practical expertise

  • 3+ years of relevant experience in site reliability, DevOps, or infrastructure engineering

What’s on Offer
  • Opportunity to work on cutting-edge AI and high-performance computing systems

  • Exposure to large-scale GPU clusters and distributed compute environments

  • High-impact engineering role supporting mission-critical AI initiatives

  • Collaborative, performance-driven technical culture

  • Competitive compensation and growth opportunities

Why Join

This is an opportunity to operate at the core of AI infrastructure powering advanced autonomous and machine learning systems. Ideal for engineers passionate about reliability engineering, automation, and optimising high-scale compute environments.

Skills Required

  • Strong proficiency in Linux fundamentals and performance tuning
  • Experience with HPC workload schedulers (Slurm, LSF)
  • Experience managing parallel file systems and high-performance storage
  • Proficiency in Python, Golang, and/or Bash
  • Hands-on experience with configuration management tools (Ansible)
  • Experience with monitoring and observability tools (Prometheus, Grafana, Splunk)
  • Familiarity with containerisation technologies such as Kubernetes
  • Experience with GPU-based computing systems and high-throughput, low-latency networks
  • Implement Infrastructure as Code (IaC) best practices
  • Bachelor's degree in Computer Science, Engineering, Physics, or equivalent practical expertise
  • 3+ years relevant experience in site reliability, DevOps, or infrastructure engineering
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
0 Employees

What We Do

REF Digital is a Montreal-based digital agency that helps businesses thrive in a digital-first economy. They specialize in designing and engineering bespoke e-commerce platforms, apps, and digital experiences engineered for lasting impact. Formed from the digital team of Groupe LG2, the agency combines strategy, technology, and design to help brands navigate the digital economy and propel their online presence to a new level.

Similar Jobs

Cargill Logo Cargill

Consultant

Food • Greentech • Logistics • Sharing Economy • Transportation • Agriculture • Industrial
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
155000 Employees

Cargill Logo Cargill

Platform Engineer

Food • Greentech • Logistics • Sharing Economy • Transportation • Agriculture • Industrial
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
155000 Employees

Capital One Logo Capital One

Principal Associate, Business Analysis

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
55000 Employees

Airwallex Logo Airwallex

Manager, Global Sales, EMEA

Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
In-Office
Bangalore, Bengaluru Urban, Karnataka, IND
2200 Employees

Similar Companies Hiring

PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees
Scotch Thumbnail
Artificial Intelligence • eCommerce • Fintech • Payments • Retail • Software • Analytics
US
35 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account