Manager, Reliability Engineering (K9)
Our Opportunity:
Chewy is looking for a Manager, Reliability Engineering for a new K9 team that can be based in either Boston, MA, Seattle, WA or Dania Beach, FL. As the Manager, Reliability Engineering (K9) at Chewy, you’ll lead a team of Reliability Engineers that are focused on ensuring customer experience is always optimal, with recovery and reliability of Chewy.com as top priority. This K9 team will work for the customers ensuring that recovery of any site degradation and making sure that Chewy’s ability to recover is fast and seamless.
What you'll do:
- Manage a team of engineers to create the foundations of reliability for Chewy.com
- Lead and develop people reporting to you with career development plans, and promoting overall cultural positivity.
- Provide the framework of reliability that can be measured and reported to our customers with the proper processes in place to scale.
- Lead Major P1 incidents as the technical lead to recover experience of any degradation to customer experience.
- Provide a framework in which processes and manual intervention is automated and optimized.
- Provide technical leadership to your team and the broader Chewy organization and tech teams to identify, and strategize how reliability can be structured throughout the entire site.
- Collect and report on Key Performance Indicators that showcase the overall Reliability of Chewy.com
- Partner with other groups in IT Operations, including Platform, Performance Engineering, and Operational leaders to build, improve, and solidify the K9 CORE objectives.
- Establish strong working relationships at all organizational levels and across functional teams and organizational boundaries
- Own, configure, and define observability tools with alerts, dashboards, and reports that showcase the overall health of Chewy.com.
- Work with others in IT Management to drive best practices in platform development, deployment, and management
- Position may require some travel (20%)
What you'll need:
- Minimum of 10 years of combined experience in the Site Reliability, or DevOps equivalent field.
- Proven ability to lead teams across multiple locations/geographies
- Knowledge of Service Level Objectives, and measuring reliability of services.
- Experience with cloud services such as AWS, and the supporting technologies
- Deep knowledge of cloud technologies, networking, and security
- Experience with monitoring tools
- Experience building systems with micro services and/or deep knowledge of SOA
- Ability to debate openly and honestly
- Experience with leading large scale major incidents to resolution
- Highly motivated to research and self-study to keep technical, business, and leadership skills relevant in a highly complex environment
- Have a metrics and automation mindset with a crave to improve and push the bar of excellence.
- Excellent verbal and written communication skills with great attention to detail and accuracy
- Bachelor’s Degree (MIS or CS preferred) or equivalent work experience
Bonus:
- Experience working in an Agile/Scrum environment
Chewy is committed to equal opportunity. We value and embrace diversity and inclusion of all Team Members.
If you have a disability under the Americans with Disabilities Act or similar law, or you require a religious accommodation, and you wish to discuss potential accommodations related to applying for employment at Chewy, please contact HR at chewy dot com
To access Chewy’s Privacy Policy, which contains information regarding information collected from job applicants and how we use it, please click here: https://www.chewy.com/app/content/privacy).