Director of Site Reliability
We are a full stack data science company and a wholly owned subsidiary of The Kroger Company. We own 10 Petabytes of data and collect 35+ Terabytes of new data each week sourced from 62 Million households. As a member of our engineering team you will use various cutting-edge technologies to develop applications that turn our data into actionable insights used to personalize the customer experience for shoppers at Kroger. We use agile development methodology bringing everyone into the planning process to build scalable enterprise applications.
What you’ll do
As a Director of Site Reliability, you will manage a team of Site Reliability Engineers who will be solving complex problems related to infrastructure, security and automation and monitoring to prevent problem occurrence. The Director of Site Reliability will be responsible for the delivery of the platform stack to SLOs (or defined success criteria), with focus on system health and availability, security, resiliency, scale, and performance. The Director of Site Reliability will partner with development teams to define and implement improvements.
Responsibilities
- Manage a team that designs, develops, troubleshoots and debugs software programs for databases, applications, tools, networks etc.
- Apply knowledge of software architecture to manage software development tasks associated with developing, debugging or designing software applications, operating systems and databases.
- Build enhancements within an existing software architecture and suggest improvements to the architecture.
- Ensures appropriate operational planning is effectively executed.
- Demonstrated leadership and people management skills.
- Strong communication skills, analytical skills, thorough understanding of product development.
- Collaborates on architectural design reviews and changes.
- Own and improve metrics, KPIs, SLOs and visualizations for systems.
- Act as an escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs).
- Building, and maintaining, robust, actionable alerting and monitoring systems and workflows Influence across boundaries and at all levels of the organization.
- Create a fun, fast paced, motivating and rewarding environment for their teams and the organization.
- Attract, build and retain a highly engaged and capable development teams that can deliver on our technology and business strategies.
Requirements:
Bachelor’s degree typically in Computer Science, Management Information Systems, Mathematics, Business Analytics or another STEM degree with a minimum of 8+ years progressive experience. Additional requirements include:
- 3+ years experience in building and managing teams
- 2+ years experience in site reliability engineering, DevOps, or related operations experience
- Proficiency in data collection and display toolsets (e.g. ELK, Prometheus, etc.)
- Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.
- Demonstrate clear understanding of automation and orchestration principles.
- Innate drive to improve existing systems and processes
- A strong focus and understanding of continuous improvement.
#LI-DOLF