- Lead SRE practices for reliability, scaling, and performance of production systems
- Lead on-call operations and incident response, ensuring fast resolution and minimal customer impact
- Perform deep debugging of production issues across infrastructure, services, and databases
- Design and automate self-healing, scalable infrastructure
- Architect and implement advanced observability (metrics, logs, traces, SLIs/SLOs, APM) to detect, debug, and prevent outages
- Support CI/CD and infrastructure automation (Terraform, Kubernetes, pipelines) as part of DevOps responsibilities (20%)
- Implement and mature observability practices including SLIs/SLOs, distributed tracing, and APM
- Mentor junior engineers in incident management and DevOps best practices
- Partner with engineering teams on resilient architecture reviews and reliability improvements
- Drive adoption of new tools and best practices to enhance infrastructure reliability
- Conduct blameless postmortems, improve incident playbooks, and build a strong prevention culture
- 5–8 years of experience in SRE / Production Engineering, with some DevOps exposure
- Strong expertise in incident management, debugging distributed systems, and on-call operations
- Strong background in observability platforms such as Prometheus, Grafana, Datadog, OpenTelemetry, or similar
- Deep knowledge of cloud infrastructure (AWS/GCP) including networking, scaling, and HA/DR setups
- Hands-on experience with Kubernetes, Terraform, and CI/CD pipelines
- Experience with incident frameworks, blameless postmortems, chaos engineering, and resiliency testing
- Ability to balance short-term firefighting with long-term reliability engineering
- Strong scripting skills (Shell, Python, or Go preferred)
Skills Required
- 5-8 years of experience in SRE / Production Engineering
- Strong expertise in incident management, debugging distributed systems, and on-call operations
- Experience with observability platforms such as Prometheus, Grafana, Datadog, OpenTelemetry
- Deep knowledge of cloud infrastructure (AWS or GCP) including networking, scaling, and HA/DR
- Hands-on experience with Kubernetes, Terraform, and CI/CD pipelines
- Experience with incident frameworks, blameless postmortems, chaos engineering, and resiliency testing
- Strong scripting skills (Shell, Python, or Go)
What We Do
GoKwik is a data & technology led enabler, building a full-stack solution suite for eCommerce and D2C brands to help them unlock business growth. Embarked on a mission to democratise the shopping experience, GoKwik enables eCommerce brands to deliver superlative customer experience across the shopping funnel thereby boosting conversion rates and revenue growth. It also solves for other critical pain points of the industry such as COD RTO (Return to Origin) and helps brands manage the RTO problem while offering COD as a payment channel. With its recent addition of a third product: KwikChat, GoKwik is solving for low ROIs on marketing campaigns through 30+ Whatsapp use cases such as abandoned cart recovery, click to whatsapp ad campaigns & headless checkout. 1 in 3 shoppers is already shopping on the GoKwik network that has helped 500+ brands scale their businesses with higher GMV realisation & profit margins. It is helmed by Chirag Taneja (Co-Founder and Chief Executive Officer), Vivek Bajpai (Co-Founder and Chief Technology Officer), and Ankush Talwar (Co-Founder and Chief Data Scientist). GoKwik is backed by investors such as Sequoia Capital, Matrix Partners India, RTP Global & Think Investments. GoKwik's team has deep knowledge in the space of eCommerce with people having previous experience in Flipkart, Razorpay, Swiggy, Myntra, Nykaa, and more.








