Location: Bengaluru, India (On-site – mandatory)
Employment Type: Full-time
Industry: AI / High-Performance Computing / Autonomous Systems
Start Date: ASAP
We are partnering with a global technology leader building next-generation AI, machine learning, and high-performance computing (HPC) infrastructure supporting autonomous systems and advanced robotics.
We are seeking a Site Reliability Engineer (SRE) – HPC / AI Infrastructure to maintain and optimise large-scale GPU clusters, high-throughput networks, and distributed compute environments that power neural network training at scale.
This is a mission-critical role focused on reliability, automation, performance optimisation, and infrastructure scalability within complex AI/ML environments.
Key ResponsibilitiesSupport and operate large-scale AI/ML cluster infrastructure on GPU platforms
Drive automation, configuration management, and scalable infrastructure deployment
Improve monitoring, alerting, and self-healing systems
Optimise server, storage, and network performance
Develop automation tools using Python, Golang, or Bash/Shell
Implement Infrastructure as Code (IaC) best practices
Enhance security posture across compute environments
Participate in 24x7 on-call rotation
Collaborate closely with AI/ML engineering teams to streamline neural network training workflows
Strong proficiency in Linux fundamentals and performance tuning
Experience with HPC workload schedulers (e.g., Slurm, LSF)
Experience managing parallel file systems and high-performance storage
Proficiency in Python, Golang, and/or Bash
Hands-on experience with configuration management tools (e.g., Ansible)
Experience with monitoring and observability tools (Prometheus, Grafana, Splunk, etc.)
Familiarity with containerisation technologies such as Kubernetes
Experience with GPU-based computing systems and high-throughput, low-latency networks is highly desirable
Bachelor’s degree in Computer Science, Engineering, Physics, or equivalent practical expertise
3+ years of relevant experience in site reliability, DevOps, or infrastructure engineering
Opportunity to work on cutting-edge AI and high-performance computing systems
Exposure to large-scale GPU clusters and distributed compute environments
High-impact engineering role supporting mission-critical AI initiatives
Collaborative, performance-driven technical culture
Competitive compensation and growth opportunities
This is an opportunity to operate at the core of AI infrastructure powering advanced autonomous and machine learning systems. Ideal for engineers passionate about reliability engineering, automation, and optimising high-scale compute environments.
Skills Required
- Strong proficiency in Linux fundamentals and performance tuning
- Experience with HPC workload schedulers (Slurm, LSF)
- Experience managing parallel file systems and high-performance storage
- Proficiency in Python, Golang, and/or Bash
- Hands-on experience with configuration management tools (Ansible)
- Experience with monitoring and observability tools (Prometheus, Grafana, Splunk)
- Familiarity with containerisation technologies such as Kubernetes
- Experience with GPU-based computing systems and high-throughput, low-latency networks
- Implement Infrastructure as Code (IaC) best practices
- Bachelor's degree in Computer Science, Engineering, Physics, or equivalent practical expertise
- 3+ years relevant experience in site reliability, DevOps, or infrastructure engineering
What We Do
REF Digital is a Montreal-based digital agency that helps businesses thrive in a digital-first economy. They specialize in designing and engineering bespoke e-commerce platforms, apps, and digital experiences engineered for lasting impact. Formed from the digital team of Groupe LG2, the agency combines strategy, technology, and design to help brands navigate the digital economy and propel their online presence to a new level.
.png)







