Associate Director, Software Engineering - Observability at Chewy (Minneapolis, MN)
Our Opportunity:
Chewy is hiring an Associate Director, Software Engineering - Observability in Boston or Minneapolis. The primary focus of this team is ensuring Chewy.com stays up. This team is responsible for systemic risk identification, handling the lifecycle of an incident, chaos engineering and resiliency consulting. In this role, you will be responsible for improving availability and resiliency of Chewy applications and services.
This is a high-profile position that will have exposure across the entire business, influencing the vision and implementation of architecture, design, and features for this new and growing line of business. As part of a dynamic team, this role offers a tremendous opportunity for professional growth in the leading online pet retailer in the US. Reporting directly to the Director of Software Engineering, the role will allow you to act as an individual contributor while leading a team of strong engineers. The role will have tremendous visibility in the technology & business organization of Chewy.
What You'll Do:
- Cross-functional engagement with other engineering teams, managing issues when they happen, as well as promoting reliability and resilience practices throughout the organization
- Manage and lead a team(s) of 5-10 experienced team of Software Engineers with in-depth operational knowledge of the entire Chewy stack, and who act as first responders and leaders during an incident
- Design, develop and implement chaos engineering practices
- Recruit and hire high-performing engineers and mentor the growth of the existing team
- Drive incidents to resolution by collaborating with multiple engineering teams
- Improve our incident management lifecycle to identify, mitigate, and learn from reliability risks
- Ensure timely and consistent communication to facilitate a clear understanding of ongoing projects and their prioritization within the organization
- Standardize RCA process and holding teams accountable to deliver on preventative and corrective actions
- Establish strong working relationships at all organizational levels and across functional teams
- Must be able to identify and manage priorities within the context of overall corporate objectives
- Create a dynamic, collaborative, and fun team environment
What you'll need:
- Bachelor’s in Computer Science, Electrical/Computer Systems Engineering or a similar math or engineering discipline.
- 10+ years in Performance Engineering, Observability, Resiliency and Chaos Engineering of largescale latency sensitive enterprise applications.
- 7+ years of experience in Engineering Management
- Expertise in ITSM process & tools like JIRA, PagerDuty and experience with ServiceNow ITOM, ITSM Modules that focuses Incident, Problem and Change Management
- Expertise in developing executive friendly dashboards based on observable metrics in IT systems (KPIs, Incident Trends, MTTR, MTTD etc.)
- Hands-on working experience with standard DevOps tools, build automation tools (Jenkins), issue tracking tools and source control systems (GitHub)
- Experience working with AWS offerings such as ECS, EC2, Lambda, Fargate, S3, DynamoDB, and API Gateway
- Solid understanding of Docker & Kubernetes or similar container-based architectures
- Excellent understanding of micro-services architecture, design patterns, and standard methodologies with an eye towards scale, automation, resiliency, and high availability
- Experienced with telemetry tooling and observability systems such as: Prometheus, Splunk, DataDog, Grafana
- Quick learner with an attitude to explore and learn new tools and framework
- Strong leadership, interpersonal, influencing, collaboration and negotiation skills
- Ability to provide thought leadership, create roadmaps and strategies for practice improvement
- Must be more than proficient at both written and spoken English
- Position may require travel
Bonus:
- Prior roles in Ecommerce or in technology companies
- Advanced Degree
- Experience working across fully automated stacks in a CI/CD ecosystem
- Knowledge of high availability proxy and load balancers (HAProxy)
- Knowledge of Content Delivery Networks (Akamai)
- Experience operating distributed streaming services such as Apache Kafka
If you have a disability under the Americans with Disabilities Act or similar law, and you need an accommodation during the application process or to perform these job requirements, or if you need a religious accommodation, please contact [email protected].
If you have a question regarding your application, please contact [email protected].
Chewy is committed to equal opportunity. We value and embrace diversity and inclusion of all Team Members.
To access Chewy’s Privacy Policy, which contains information regarding information collected from job applicants and how we use it, please click here: https://www.chewy.com/app/content/privacy).