Site Reliability Engineer II

Reposted 13 Days Ago
Be an Early Applicant
Trivandrum, Thiruvananthapuram, Kerala
In-Office
Senior level
Fintech • Payments • Software
The Role
Lead and manage complex technical issues in Azure cloud, optimize infrastructure, mentor junior engineers, and drive operational excellence initiatives.
Summary Generated by Built In

The world’s top banks use Zafin’s integrated platform to drive transformative customer value. Powered by an innovative AI-powered architecture, Zafin’s platform seamlessly unifies data from across the enterprise to accelerate product and pricing innovation, automate deal management and billing, and create personalized customer offerings that drive expansion and loyalty. 

Zafin empowers banks to drive sustainable growth, strengthen their market position, and define the future of banking centered around customer value. 

Senior Site Reliability Engineer (SRE II) 

Own availability, latency, performance, and efficiency for Zafin’s SaaS on Azure. You’ll define and enforce reliability standards, lead high-impact projects, mentor engineers, and eliminate toil at scale. Reports to the Director of SRE.

What you’ll do

  • SLIs/SLOs & contracts: Define customer-centric SLIs/SLOs for Tier-0/Tier-1 services. Publish, review quarterly, and align teams to them.
  • Error budgeting (policy & tooling):
    • Run the error-budget policy with multi-window, multi-burn-rate alerts; clear runbooks and paging thresholds.
    • Gate changes by budget status (freeze/relax rules) wired into CI/CD.
    • Maintain SLO/EB dashboards (Azure Monitor, Grafana/Prometheus, App Insights). Run weekly SLO reviews with engineering/product.
    • Drive roadmap tradeoffs when budgets are at risk; land reliability epics.
  • Incidents without drama: Lead SEV1/SEV2, own comms, run blameless postmortems, and make corrective actions stick.
  • Engineer reliability in: Multi-AZ/region patterns (active-active/DR), PDBs/Pod Topology Spread, HPA/VPA/KEDA, resilient rollout/rollback.
  • AKS at scale: Harden clusters (network, identity, policy), optimize node/pod density, ingress (AGIC/Nginx); mesh optional.
  • Observability that works: Metrics/traces/logs with Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana, OpenTelemetry. Alert on symptoms, not noise.
  • IaC & policy: Terraform/Bicep modules, GitOps (Flux/Argo), policy-as-code (Azure Policy/OPA Gatekeeper). No snowflakes.
  • CI/CD reliability: Azure DevOps/GitHub Actions with canary/blue-green, progressive delivery, auto-rollback, Key Vault-backed secrets.
  • Capacity & performance: Load testing, right-sizing, autoscaling; partner with FinOps to reduce spend without hurting SLOs.
  • DR you can trust: Define RTO/RPO, test backups/restore, run game days/chaos drills, validate ASR and multi-region failover.
  • Secure by default: Entra ID (Azure AD), managed identities, Key Vault rotation, VNets/NSGs/Private Link, shift-left checks in CI.
  • Reduce toil: Automate recurring ops, build self-service runbooks/chatops, publish golden paths for product teams.
  • Customer escalations: Be the technical owner on calls; communicate tradeoffs and recovery plans with authority.
  • Document to scale: Architectures, runbooks, postmortems, SLIs/SLOs—kept current and discoverable.
  • (If applicable) Streaming/ETL reliability: Apply SRE practices (SLOs, backpressure, idempotency, replay) to NiFi/Flink/Kafka/Redpanda data flows.

How we’ll measure success (first 12 months)

  • 100% of Tier-1 services have published SLOs with error budgets and burn-rate alerts in place.
  • Error-budget depletion before window end ↓ 40%; fast-burn alerts route within 5 min and include a rollback path.
  • MTTR ↓ 30% and incident recurrence ↓ 40% on top offenders.
  • Alert quality: ≥80% actionable; pages per on-call shift stable or declining.
  • Change failure rate <15% with automated rollback always available.
  • Toil ↓ 30% via automation; ≥50% time on engineering vs. reactive ops.
  • DR drills pass agreed RTO/RPO across Tier-1 services.

Minimum qualifications

  • Bachelor’s in CS/Engineering (or equivalent experience).
  • 12+ years in production ops/platform/SRE, including 5+ years on Azure.
  • PostgreSQL (must-have): Deep operational expertise incl. HA/DR, logical/physical replication, performance tuning (indexes/EXPLAIN/ANALYZE, pg_stat_statements), autovacuum strategy, partitioning, backup/restore testing, and connection pooling (pgBouncer). Prefer experience with Azure Database for PostgreSQL – Flexible Server.
  • Azure core: AKS (must-have); Front Door/App Gateway, API Management, VNets/NSGs/Private Link, Storage, Key Vault, Redis, Service Bus/Event Hubs.
  • Observability: Azure Monitor/App Insights, Log Analytics, Prometheus/Grafana; SLO design and error-budget operations.
  • IaC/automation: Terraform and/or Bicep; PowerShell and Python; GitOps (Flux/Argo). Pipelines in Azure DevOps or GitHub Actions.
  • Proven incident leadership at scale, blameless postmortems, and SLO/error-budget governance with change gating.
  • Mentorship and crisp written/verbal communication.

Preferred (nice to have)

  • Apache NiFi, Apache Flink, Apache Kafka or Redpanda (self-managed on AKS or managed equivalents); schema management, exactly-once semantics, backpressure, dead-letter/replay patterns.
  • Azure Solutions Architect Expert, CKA/CKAD.
  • ITSM (ServiceNow), on-call tooling (PagerDuty/Opsgenie).
  • Compliance/SecOps (SOC 2, ISO 27001), policy-as-code, workload identity.
  • OpenTelemetry, eBPF tooling, or service mesh.
  • Multi-tenant SaaS and cost optimization at scale.

How we work

  • Follow-the-sun on-call for Tier-1 services.
  • Data over opinions. Postmortems are blameless; fixes are mandatory.
  • Pragmatic automation: if it’s run twice, script it. If it’s run weekly, productize it.

What’s in it for you

Joining our team means being part of a culture that values diversity, teamwork, and high-quality work. We offer competitive salaries, annual bonus potential, generous paid time off, paid volunteering days, wellness benefits, and robust opportunities for professional growth and career advancement. Want to learn more about what you can look forward to during your career with us? Visit our careers site and our openings: zafin.com/careers

Zafin welcomes and encourages applications from people with disabilities. Accommodations are available on request for candidates taking part in all aspects of the selection process. 

Zafin is committed to protecting the privacy and security of the personal information collected from all applicants throughout the recruitment process. The methods by which Zafin contains uses, stores, handles, retains, or discloses applicant information can be accessed by reviewing Zafin’s privacy policy at https://zafin.com/privacy-notice/. By submitting a job application, you confirm that you agree to the processing of your personal data by Zafin described in the candidate privacy notice.

Top Skills

Aks
Azure Devops
Azure Insights
Grafana
Azure
Openshift
Postgres
Powershell
Python
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Vancouver, British Columbia
450 Employees
Year Founded: 2002

What We Do

Zafin, the global leader in SaaS cloud-native product and pricing solutions, is a trusted partner to the world’s most customer-centric financial institutions. Zafin’s product and pricing platform empowers banks of all sizes to center their customers, grow relationships and drive revenues.

The Zafin platform separates product and pricing from core processing to accelerate progressive modernization, enable digital transformation and deliver personalization at the relationship level.

A typical Zafin installation integrates easily with most back-end systems and customer-facing channels to increase product and pricing efficiency and agility, drive interest and non-interest income, and deliver a positive ROI—often in one year or less. 

Similar Jobs

CrowdStrike Logo CrowdStrike

Manager, Threat Research (Remote, IND)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
19 Locations
10000 Employees
12-12 Annually

CrowdStrike Logo CrowdStrike

Engineer II - Site Reliability, 2PM-11PM IST (Remote, IND)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
16 Locations
10000 Employees

CrowdStrike Logo CrowdStrike

Senior Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
17 Locations
10000 Employees

CrowdStrike Logo CrowdStrike

Engineering Manager

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
18 Locations
10000 Employees

Similar Companies Hiring

PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees
Rain Thumbnail
Web3 • Payments • Infrastructure as a Service (IaaS) • Fintech • Financial Services • Cryptocurrency • Blockchain
New York, NY
40 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account