Senior Site Reliability Engineer- San Francisco, CA, the US

Posted 10 Days Ago
San Francisco, CA, USA
In-Office
Senior level
Fintech • Payments • Software • Financial Services
The Role
Senior SRE responsible for ensuring platform scalability, reliability, and runtime efficiency on AWS. Own CI/CD and GitHub repo workflows, lead incident response and post-mortems, implement observability/monitoring and logging, and collaborate cross-border using bilingual Mandarin and English.
Summary Generated by Built In

Senior Site Reliability Engineer (Payments Infrastructure)
Kody is seeking a Senior Site Reliability Engineer to ensure the reliability, availability, scalability, and operational excellence of our global payment platform. You will own production observability, incident response, service-level management, and cloud infrastructure reliability across mission-critical payment processing systems operating in Europe, Asia, and North America.
Responsibilities

  • Participate in a follow-the-sun production on-call rotation as a primary incident responder.
  • Diagnose, triage, mitigate, and coordinate resolution of production incidents across payment services, Kubernetes platforms, databases, messaging systems, and cloud infrastructure.
  • Define and maintain SLOs, SLIs, error budgets, alerting standards, and operational readiness processes.
  • Drive reliability improvements through automation, observability, capacity planning, performance optimization, and post-incident reviews.
  • Partner with engineering teams to improve resilience, security, and operational maturity in PCI-DSS-regulated environments.
  • Lead incident management during SEV1/SEV2 events and improve response effectiveness and MTTR.

Requirements
  • 5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Cloud Infrastructure roles supporting mission-critical production systems.
  • Strong hands-on experience with AWS, Kubernetes (EKS), Terraform, PostgreSQL, Redis, Kafka, Linux, networking, and modern observability platforms.
  • Deep understanding of distributed systems, cloud-native architectures, high availability, disaster recovery, capacity planning, and performance optimization.
  • Proven experience operating payment, banking, fintech, or other highly regulated systems with stringent security, compliance, and uptime requirements.
  • Strong knowledge of SRE principles, including SLOs, SLIs, error budgets, incident management, alert governance, and operational excellence.

Leadership & Operational Excellence

  • Demonstrates strong ownership and accountability, taking end-to-end responsibility for service reliability and customer impact.
  • Possesses a strong sense of urgency during production incidents while maintaining sound judgment and structured decision-making under pressure.
  • Applies a systematic and methodical approach to troubleshooting, root-cause analysis, and incident resolution in complex distributed environments.
  • Data-driven mindset with the ability to leverage metrics, telemetry, trends, and service-level indicators to prioritize reliability investments and operational improvements.
  • Continuously drives engineering excellence through iterative improvement, automation, standardization, and elimination of operational toil.
  • Proven ability to lead cross-functional incident response efforts, coordinate stakeholders, and communicate effectively during high-severity production events.
  • Champions a culture of operational readiness, continuous learning, post-incident improvement, and blameless accountability.
  • Demonstrates strong mentoring and technical leadership skills, influencing engineering teams to build reliable, scalable, and resilient systems by design.

Benefits
  • Competitive packages aligned with California market standards
  • Lead a dynamic and innovative team in a very rapidly growing company
  • Collaborative, inclusive environment where your contributions are recognized and valued

Skills Required

  • Deep practical experience managing application deployment and runtime environments on AWS
  • Master-level knowledge of advanced Git workflows and GitHub Actions
  • Ownership of CI/CD pipelines and repository management (GitHub)
  • Strong proficiency with monitoring tools, log management, alerting, and observability
  • Proficient scripting skills for triage and troubleshooting
  • Lead incident management and conduct blameless post-mortems
  • Absolute fluency in Mandarin and English (verbal and written)
  • Based in California (San Jose)
  • High ownership, transparency, and resilience under pressure
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: London
53 Employees
Year Founded: 2018

What We Do

Kody is on a mission to make in-person payment acceptance easy. Today, paying in person presents common problems for businesses, such as high costs, long queues, and limited choice of payment methods. Kody fully integrates the payment ecosystem. This way, businesses can offer customers more control over their payment choices to make transactions quicker and simpler. Founded by a small group of final-year high school students and launched in July 2022, 24-year-old founder Yoyo Chang (CEO) studied at the University of Cambridge & York whilst raising US$10M. Today, Kody's platform is growing to connect millions of end-users with venues all over the world.

Similar Jobs

Nectar Social Logo Nectar Social

Senior Site Reliability Engineer

Artificial Intelligence • eCommerce
Hybrid
Palo Alto, CA, USA
20 Employees
30K-30K Annually

Drata Logo Drata

Senior Site Reliability Engineer

Security • Software • Cybersecurity • Automation
Hybrid
San Francisco, CA, USA
600 Employees
167K-226K Annually

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
10 Locations
5550 Employees
127K-249K Annually

Illumio Logo Illumio

Senior Site Reliability Engineer

Software • Cybersecurity
In-Office
Sunnyvale, CA, USA
552 Employees
170K-196K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account