Sr. Site Reliability Engineer - Incident Response (Hybrid)
<b>About HashiCorp</b><p>HashiCorp solves development, operations, and security challenges in infrastructure so organizations can focus on business-critical tasks. We build products to give organizations a consistent way to manage their move to cloud-based IT infrastructures for running their applications. Our products enable companies large and small to mix and match AWS, Microsoft Azure, Google Cloud, and other clouds as well as on-premises environments, easing their ability to deliver new applications.</p><br><p>At HashiCorp, we have used the Tao of HashiCorp as our guiding principles for product development and operate according to a strong set of company principles for how we interact with each other. We value top-notch collaboration and communication skills, both among internal teams and in how we interact with our users.</p><br><b>About this Role</b><p>As a Senior Site Reliability Engineer specializing in Incident Response, you will play a pivotal role in enhancing our operational resilience and maintaining the reliability of our cloud-based products. With a focus on rapid identification, response, and resolution of incidents, you will be at the forefront of ensuring high availability and performance across HashiCorp’s offerings.</p><br><b>In this role, you can expect to:</b><ul><li>Lead and refine our incident response strategy, ensuring rapid and effective response to operational disruptions.</li><li>Implement best practices for system reliability, including proactive identification of potential failure points and the development of automated mitigations.</li><li>Work closely with development, operations, and security teams to coordinate incident response efforts and post-incident analyses.</li><li>Analyze incident trends and root causes to drive continuous improvements in system reliability and response processes.</li><li>Develop and maintain tools for incident detection, analysis, and resolution, automating responses where possible to minimize human intervention.</li><li>Create comprehensive incident response documentation and conduct training sessions to prepare all relevant teams for effective incident handling.</li><li>Participate in and occasionally lead the on-call rotation, serving as a key decision-maker in the management of live incidents.</li></ul><p><br/></p><b>You may be a good fit for our team if:</b><ul><li>6+ years of experience in site reliability engineering, systems administration, or software engineering, with a significant focus on incident response and operational reliability.</li><li>Proven track record of managing and resolving incidents in cloud-based environments, with expertise in major public cloud platforms (AWS, GCP, Azure).</li><li>Strong understanding of monitoring and alerting systems, with the ability to develop metrics and alarms that accurately reflect system health and operational risks.</li><li>Experience with incident management tools and practices, including post-mortem analysis and root cause investigation.</li><li>Excellent communication skills, capable of working effectively across multiple teams and with stakeholders at all levels.</li><li>Familiarity with HashiCorp’s product suite and infrastructure automation tools is a plus. #LI-Hybrid</li></ul>