An Incident Lead will be responsible for managing the day-to-day operations, ensuring platform reliability, and overseeing incident management and resolution processes. You will collaborate closely with engineering, product, and infrastructure teams to ensure the smooth functioning of systems and provide a high level of operational support to meet business goals.
Key Skills: Docker, Kubernetes, CI/CD, Azure DevOps or AWS Code Pipeline
Job responsibilities
- Lead and mentor the L1/L2 operations teams, ensuring a high level of technical support and service quality.
- Lead incident resolution processes for L1 and L2 operations, ensuring timely and effective troubleshooting of technical issues.
- Define and implement procedures for handling escalations and high-priority incidents.
- Ensure root cause analysis is conducted for major incidents and follow up on remediation actions.
- Develop and enforce Service Level Agreements (SLAs) and Key Performance Indicators (KPIs) for platform performance and support operations.
- Monitor adherence to SLAs and manage escalations to maintain customer satisfaction.
- Oversee the platform's operational stability and performance, ensuring high availability and scalability.
- Monitor and manage platform performance metrics, proactively addressing any potential issues.
- Ensure comprehensive documentation of operational procedures, troubleshooting guides, and runbooks for the L1/L2 support teams.
- Create detailed operational reports and dashboards for tracking system health and team performance.
Qualifications:
- Bachelor's degree in computer science, Information Technology, or a related field.
- 10+ years of experience in IT operations, with at least 3 years in a leadership role managing platform support and L1/L2 teams.
- Strong understanding of IT infrastructure, cloud platforms, and operational best practices.
- Strong experience on Docker, Kubernetes & Helm along with any programming language (Java preferred) experience to support platform KLO & monitoring.
- ITIL certification will be highly preferred.
- Extensive experience with implementing and managing CI/CD pipelines using tools like Jenkins, GitLab CI/CD, GitHub Actions, Azure DevOps or AWS Code Pipeline.
- Tools: Any IDE, Git, Jenkins
- Proven experience with incident management, service management, and driving process improvements.
- Expertise in monitoring tools, automation frameworks, and platform performance optimization.
- Excellent leadership, communication, and problem-solving skills.
Top Skills
What We Do
We Empower & Transform customers’ business through the use of digital technologies.
Our core focus areas are Big-Data, Cloud, Analytics (AI, ML), Blockchain, Automation & Mobility.
We enable navigation of digital transformation for several fortune 1000 clients in USA, Canada, UK & India.
NucleusTeq is a software services, solutions & products company empowering & transforming customers’ business through the use of digital technologies such as Big-Data, Analytics (AI, ML), Cloud, Enterprise Automation, Block-chain, Mobility, CRM & ERP.