Senior Site Reliability Engineer

Reposted 19 Days Ago
Be an Early Applicant
Hiring Remotely in Canada
Remote
Mid level
Information Technology
The Role
As an Infrastructure Engineer at Glia, you will manage cloud-native infrastructure, ensure system performance, automate processes, and collaborate on security compliance projects.
Summary Generated by Built In
About Glia

Our award-winning technology powers conversations with customers for some of the world’s largest enterprises. We believe that combining the human touch with technology is the best way to create amazing customer experiences. When human abilities such as problem-solving, creative thinking and relationship building are enhanced with technology... magical moments happen.

The Team

You'll be joining our dedicated Infrastructure Team, which is responsible for the reliability, scalability, and performance of Glia’s cloud-native core infrastructure serving the conversational AI. Our team focuses on operational excellence and proactive problem-solving to ensure our systems are always available and performing optimally.

All SREs on the team report to a dedicated Engineering Manager. Our work is driven by Objectives and Key Results, defined quarterly in collaboration with the Director of Engineering. All projects are planned, led, and executed by our engineers. Our SRE team is located primarily in Vancouver and Toronto and works in the Pacific Time zone (PT). We are optimized for remote collaboration and welcome candidates from anywhere in Canada.

The Work

As a Senior Site Reliability Engineer, your primary focus will be on the health and performance of our production services. Responsibilities will include:

  • Defining, measuring, and reporting on Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key services.
  • Partnering with development teams to establish error budgets and the operational consequences of their consumption.
  • Writing software to automate production operations, eliminating manual toil and improving system resilience.
  • Leading the incident response process for complex outages, including conducting blameless postmortems to drive systemic improvements.
  • Engineering and improving deployment systems and CI/CD pipelines to increase release velocity while maintaining production stability.
  • Conducting deep dives into system performance, engaging in capacity planning, and performing production readiness reviews.
  • Developing and maintaining operational runbooks and incident response playbooks.
  • Participating in a periodic on-call rotation as an escalation point for critical service interruptions.

Our Tech Stack

  • Infrastructure: AWS, Kubernetes (AWS EKS), Linkerd, EFK
  • Persistence: Amazon Aurora Serverless for Postgres, RabbitMQ
  • Cache: Amazon ElastiCache for Valkey
  • Monitoring & Observability: DataDog with a focus on dashboards and alerts for system health.
  • CI/CD: Github Actions, ArgoCD, Jenkins, Helm, with a focus on automation and pipeline optimization.
  • Infrastructure as Code: Terraform

Additionally, our Engineering teams use:

  • Backend: Python, Elixir, Node.js, and Ruby
  • Frontend: Javascript and React.js
  • Native mobile SDKs: Java and Swift

Candidate Requirements

  • 5+ years of relevant experience in Site Reliability Engineering or a closely related discipline (e.g., DevOps, Platform Engineering, Infrastructure).
  • Deep, practical understanding of Site Reliability Engineering (SRE) principles (SLOs, error budgets, toil reduction).
  • Demonstrable experience analyzing and troubleshooting large-scale distributed systems.
  • Expert-level proficiency with AWS and Kubernetes (EKS), particularly in areas of observability, networking, and auto-scaling.
  • Strong software development skills in a language like Python or Go, used to build operational tools, services, or automation.
  • Experience with modern observability platforms (e.g., DataDog, Prometheus) and a deep understanding of metrics, logging, and tracing.
  • Expertise in designing and operating robust CI/CD pipelines for a microservices architecture (e.g., using ArgoCD, Github Actions, Helm).
  • A systematic, data-driven approach to problem-solving and root cause analysis.

We are insatiably curious and hungry for knowledge here at Glia. Even if you don’t meet all the requirements exactly, we encourage you to apply as long as you are passionate about mastering your craft and developing your skills.

*Glia is an equal-opportunity employer. Glia does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law.

The Glia Talent Acquisition team uses @glia.com and @gliatalent.com, mailboxes for coordinating interviews, providing updates, and sending documents. Our hiring process involves an introduction, practical and team interviews, and a decision and offer. For more information, visit our Recruitment Privacy Notice page or contact our talent team via [email protected]

*Want to know more about working at Glia?  Check our Glia's Career FAQs

Top Skills

Amazon Aurora Serverless
Amazon Elasticache
Amazon Rds
AWS
Efk
Elixir
Grafana
Helm
Java
JavaScript
Jenkins
Kubernetes
Linkerd
Nginx
Node.js
Postgres
Prometheus
Python
RabbitMQ
React
Ruby
Spinnaker
Swift
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: New York, NY
329 Employees
Year Founded: 2012

What We Do

Glia enables companies to deliver an in-person customer experience online. With a single line of code, companies can identify and engage their highest-value web visitors through video, voice, chat, and CoBrowsing to increase online conversions and improve customer support.

Similar Jobs

GitLab Logo GitLab

Senior Site Reliability Engineer

Cloud • Security • Software • Cybersecurity • Automation
Easy Apply
Remote
29 Locations
2500 Employees

GitLab Logo GitLab

Senior Site Reliability Engineer

Cloud • Security • Software • Cybersecurity • Automation
Easy Apply
Remote
Canada
2500 Employees

Affirm Logo Affirm

Senior Software Engineer

Big Data • Fintech • Mobile • Payments • Financial Services
Easy Apply
Remote
Canada
2200 Employees
150K-200K

TextNow Logo TextNow

Senior Site Reliability Engineer

Digital Media • Social Media
In-Office or Remote
Open Hall, Subd. F, NL, CAN
239 Employees
113K-162K Annually

Similar Companies Hiring

Axle Health Thumbnail
Logistics • Information Technology • Healthtech • Artificial Intelligence
Santa Monica, CA
17 Employees
Scrunch AI Thumbnail
Software • SEO • Marketing Tech • Information Technology • Artificial Intelligence
Salt Lake City, Utah
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account