Senior Site Reliability Engineer

Reposted 12 Days Ago
Hiring Remotely in USA
Remote
Senior level
Software
The Role
As a Senior Site Reliability Engineer, you will enhance system reliability and performance, manage core infrastructure, automate operations, and lead incident responses at Rocket.Chat.
Summary Generated by Built In

Job Title: Senior Site Reliability Engineer

Level: Senior

Working Hours: Full Time (40h/Week)

Contract: Contractor

Location: Remote


Your Team 👥

You will report to our Head Of Infrastructure and Deployment and join the Engineering team. The Site Reliability Engineering (SRE) team is dedicated to engineering, maintaining, and continuously improving the reliability, scalability, and performance of all critical Rocket.Chat systems and services. Our mission is to ensure an exceptional and uninterrupted experience for our users and customers, bridging the gap between development and operations to deliver value efficiently and automatically. On TheOrg you can view the complete structure of our organisation, including information about every team member, hiring managers and the size of each department.


Your Responsabilities ✏️ 

As a Senior Site Reliability Engineer, you will play a critical role in enhancing the reliability, performance, and scalability of Rocket.Chat's entire ecosystem. You will apply software engineering principles to infrastructure and operations, proactively preventing outages, optimizing system efficiency, and ensuring that new features and services are delivered with the highest standards of stability. Your expertise will be instrumental in delivering exceptional user experiences across our core platform, internal infrastructure, and customer-facing services.


Mandatory Hard Skills 🎯

  • Strong background in software engineering with expertise in large-scale distributed systems.
  • Expertise in Kubernetes, including operator development, and cloud platforms (e.g., AWS, GCP, Azure, OVH).
  • Proficiency in programming/scripting languages such as Go, Python, or Bash for tooling and operator development.
  • Deep, hands-on experience with monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Loki).
  • Experience with Infrastructure as Code (IaC) tools like Terraform, Pulumi or Ansible and CI/CD pipelines using tools like ArgoCD.
  • Solid understanding of networking fundamentals (TCP/IP, DNS, routing) and security principles.
  • Familiarity with database technologies such as MongoDB or Redis.

Desirable Hard Skills 💕 

  • Practical experience with chaos engineering principles and tools.
  • Experience with disaster recovery planning, testing, and implementation.
  • Familiarity with agile management tools such as Jira.

Soft Skills ✨

  • Proactive Mindset: Anticipate and address potential issues before they impact users.
  • Collaboration: Work seamlessly with other teams, sharing knowledge and expertise to drive reliability.
  • Problem-Solving: Strong troubleshooting and analytical skills to identify the root cause of complex issues across diverse technical stacks.
  • Leadership: Guide and inspire team members, especially during incidents, and effectively communicate with both technical and non-technical stakeholders.
  • Data-Driven Decisions: Base decisions on metrics and data to drive improvements.
  • Passion: Genuine enthusiasm for what you do and how it contributes to our company's mission;
  • Dream: Proactively seek out opportunities and challenges to achieve extraordinary results. If you're someone who takes initiative and is always striving to improve, you'll fit right in;
  • Own: Take ownership of your work, set high standards for yourself, and be accountable for outcomes demonstrating a strong sense of responsibility and commitment. Take full responsibility for the reliability and performance of all Rocket.Chat services and infrastructure.
  • Trust: Recognizing the importance of trust and support and actively working towards a collaborative and inclusive workplace;
  • Share: Communicating openly and transparently, ensures clarity and honesty in interactions. 

What You'll Do 🖥️

  • Engineer & Operate Deployment & Platform Services: Design, develop, and maintain the Kubernetes Operators at the core of our managed hosting offerings, ensuring their reliability, scalability, and robust error handling.
  • Manage & Optimize Core Infrastructure: Oversee the reliability and performance of foundational infrastructure, including multiple Kubernetes clusters and critical services like ArgoCD, Traefik, and our monitoring stack.
  • Ensure Service Reliability & Uptime: Define, monitor, and enforce SLOs for all critical services, manage error budgets, and implement robust monitoring, alerting, and logging solutions.
  • Automate Operations & Reduce Toil: Develop and maintain automation frameworks for deployment, configuration, and operational tasks, building internal tools to streamline SRE workflows.
  • Lead Incident Management & On-Call Response: Act as a primary responder for critical alerts, lead blameless post-mortems, and continuously improve runbook documentation and disaster recovery plans.
  • Foster Cross-Functional Collaboration: Engage early in the product lifecycle to ensure reliability is built-in, and collaborate with Engineering, Security, and QA to integrate reliability best practices.
  • Implement Advanced Reliability Practices: Conduct proactive load testing, performance analysis, and chaos engineering experiments to identify system weaknesses and improve fault tolerance.

Benefits ✨

  • Fully Remote & Flexible Working Hours
  • Flexible Paid Time Off, Holidays and Vacation
  • Company Laptop
  • Remote Benefit
  • iTalki, Courses and Books 
  • Stock Options
  • Multicultural Environment
  • Vibrant Company Culture 

Check out our handbook to dive into each of our awesome benefits! At Rocket.Chat, we have tailored base pay ranges according to work locations. This approach ensures that we can competitively and consistently compensate our employees across different geographic markets.


  • While we define an initial seniority level and budget for each role, this can be adjusted during the hiring process. The selection process itself — including interviews and assessments — helps us better understand where the candidate fits within our career framework and which grade they should be positioned in.
  • To ensure fairness and consistency, all applications are accepted exclusively via our Careers site. Submissions through other channels will not be taken into consideration.

About Rocket.Chat 🚀

‍Rocket.Chat is the world's largest open-source communications platform. Built for organizations needing more control over their communications, Rocket.Chat Secure CommsOS™ is a communication platform that unifies messaging, voice, video, AI, and mission-critical applications—ensuring uncompromising security, compliance, and operational efficiency for governments, defense, and critical infrastructure organizations operating in highly-regulated environments.


Tens of millions of users in over 150 countries and organizations such as Deutsche Bahn, the U.S. Navy and Credit Suisse trust Rocket.Chat every day to keep their communications completely private and secure. As Rocket.Chat we believe in reconnecting the world, one conversation at a time! 


See yourself in that? So apply now! Check out our handbook for more information about our rocket. 

If you're interested in keeping up with new roles at Rocket.Chat, you can now set up custom job alerts. Just click the link, pick the types of roles you want to hear about, and get notified whenever there’s a match.

Top Skills

Ansible
Argocd
AWS
Azure
Bash
GCP
Go
Grafana
Kubernetes
Loki
MongoDB
Prometheus
Pulumi
Python
Redis
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Wilmington, Delaware
179 Employees
Year Founded: 2015

What We Do

Let Every Conversation Flow — Without Compromise.

Built for organizations that need more control over their communications; it enables collaboration between colleagues, partners, customers, communities, and even platforms without compromises on data ownership, customizations, or integrations.

Tens of millions of users in over 150 countries and organizations such as Deutsche Bahn, The US Navy, and Credit Suisse trust Rocket.Chat every day to keep their communications completely private and secure.

Similar Jobs

Atlassian Logo Atlassian

Senior Site Reliability Engineer

Cloud • Information Technology • Productivity • Security • Software • App development • Automation
In-Office or Remote
San Francisco, CA, USA
172K-269K Annually

GitLab Logo GitLab

Senior Site Reliability Engineer

Cloud • Security • Software • Cybersecurity • Automation
Easy Apply
Remote
US
124K-266K Annually

Red Hat Logo Red Hat

Senior Site Reliability Engineer

Cloud • Information Technology • Internet of Things • Software • Consulting • Infrastructure as a Service (IaaS) • Automation
Remote
4 Locations
111K-184K Annually

Capital One Logo Capital One

Lead Software Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Remote or Hybrid
2 Locations
205K-257K Annually

Similar Companies Hiring

Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees
PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account