Location: Bengaluru, India (On-site mandatory)
Employment Type: Full-time
Industry: AI / Autonomous Systems / Advanced Engineering Infrastructure
Start Date: ASAP
We are seeking a Staff Site Reliability Engineer (SRE) – Engineering Tools to support and scale critical engineering platforms that power advanced AI, machine learning, simulation, and autonomous technology development.
This is a senior-level role operating at the intersection of reliability engineering, internal developer platforms, tooling automation, and large-scale infrastructure performance. You will play a key role in ensuring that engineering teams have highly reliable, scalable, and secure systems to accelerate innovation.
Key ResponsibilitiesOwn reliability, scalability, and performance of engineering tools and infrastructure platforms
Design and implement automation frameworks for system deployment and configuration management
Improve observability, monitoring, and self-healing capabilities across engineering environments
Troubleshoot complex Linux-based systems and optimize performance
Develop automation and internal tooling using Python, Golang, or Bash
Implement Infrastructure-as-Code best practices
Strengthen security posture across engineering systems
Partner with cross-functional teams to streamline development workflows
Participate in on-call rotation for critical systems
Strong expertise in Linux systems administration and performance optimization
Experience with distributed systems and large-scale infrastructure environments
Proficiency in Python, Golang, and/or Bash scripting
Hands-on experience with configuration management tools (e.g., Ansible)
Experience with monitoring and observability platforms (Prometheus, Grafana, Splunk, etc.)
Familiarity with container orchestration technologies such as Kubernetes
Experience supporting developer platforms, CI/CD tooling, or internal engineering systems
Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent practical experience)
Significant experience in Site Reliability Engineering, DevOps, or platform engineering (Staff-level seniority)
Ownership of mission-critical engineering systems
Architectural input into reliability and scalability strategy
Mentorship of junior SREs and platform engineers
Direct impact on high-scale AI and autonomous development environments
Opportunity to work on cutting-edge engineering infrastructure
High-impact role supporting AI and advanced technology platforms
Collaborative, engineering-driven culture
Competitive compensation and long-term career growth
Skills Required
- Strong expertise in Linux systems administration and performance optimization
- Experience with distributed systems and large-scale infrastructure environments
- Proficiency in Python, Golang, and/or Bash scripting
- Hands-on experience with configuration management tools (e.g., Ansible)
- Experience with monitoring and observability platforms (Prometheus, Grafana, Splunk)
- Familiarity with container orchestration technologies such as Kubernetes
- Experience supporting developer platforms, CI/CD tooling, or internal engineering systems
- Implement Infrastructure-as-Code best practices
- Significant experience in Site Reliability Engineering, DevOps, or platform engineering (Staff-level seniority)
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent practical experience)
What We Do
REF Digital is a Montreal-based digital agency that helps businesses thrive in a digital-first economy. They specialize in designing and engineering bespoke e-commerce platforms, apps, and digital experiences engineered for lasting impact. Formed from the digital team of Groupe LG2, the agency combines strategy, technology, and design to help brands navigate the digital economy and propel their online presence to a new level.









