Lead Site Reliability Engineer

Reposted 2 Days Ago
Be an Early Applicant
Pittsburgh, PA, USA
In-Office
146K-162K Annually
Senior level
Financial Services
The Role
The Lead Site Reliability Engineer will establish the SRE operating model, implement AI-enabled reliability use cases, manage reliability metrics, and oversee operational readiness while collaborating with teams and mentoring engineers.
Summary Generated by Built In

Become an everyday champion — and build a career where your impact fuels financial progress.


What We Do

CardWorks Financial Group is a diversified financial services platform building ethical solutions across credit, lending, and the full customer lifecycle. Through our family of companies, CardWorks Financial Group tackles the complex challenges that larger financial institutions leave behind. We’re embedded throughout the credit card ecosystem as a lender, servicer, and merchant acquirer.


Who We Are

  • Merrick Bank: The bank that builds
  • CardWorks Servicing: One partner, total performance
  • Carson Smithfield: Resolution with respect

With nearly 40 years of operating history, our track record is solid: disciplined in downturns and built to accelerate in recovery. The CardWorks Financial Group companies take precise approach in complex markets, as a top three non-prime focused general purpose card issuer and a top fifteen U.S. merchant acquirer. 


Our team tackles the industry’s most complex credit and payment challenges. And we believe that excellent work starts with a team that feels supported, respected, and empowered to grow.

CardWorks Servicing, LLC provides end-to end operational servicing functions for credit cards, secured cards, and installment loans.  We service consumer and small business loans across the credit spectrum and offers backup servicing and due diligence services to capital providers and trustees.


Founded in 1997, Merrick Bank is an FDIC®-insured financial institution headquartered in South Jordan, Utah, with over $10 billion in assets. A wholly owned subsidiary of CardWorks Financial Group, Merrick Bank serves roughly five million cardmembers and more than 100,000 merchant customers, offering credit cards, recreational loans, deposit accounts, merchant services and bank sponsorships to consumers and businesses.

Carson Smithfield, LLC provides a variety of post-charge-off debt recovery services, including digital self-service, IVR, live agent, and external agency management.

Essential Functions:

  • Establish the SRE operating model (service onboarding, engagement model, governance, reliability reviews, production readiness standards, and quarterly planning) and ensure it is adopted across teams.

  • Identify, pilot, and operationalize AI-enabled reliability use cases (e.g., alert noise reduction, incident summarization, correlation/root-cause hypothesis generation, runbook assistance, and auto-remediation with human approval) with appropriate guardrails.

  • Define, implement, and operationalize reliability metrics by establishing and managing SLIs, SLOs, and error budgets to quantify and continuously improve service reliability, supporting engineering and business decisions.

  • Own the centralized SRE service engagement model by defining service tiers, onboarding criteria, reliability standards, and a transparent intake/prioritization process aligned to business criticality.

  • Define and enforce error budget policies (including escalation paths and release risk decisions) in partnership with Product/Engineering, using SLO attainment to guide trade-offs between feature velocity and reliability

  • Establish and maintain centralized “paved road” reliability standards and assets (instrumentation conventions, golden signals, alerting standards, runbook templates, SLO dashboards) that product teams can adopt with minimal friction.

  • Design the on-call and escalation model for a centralized SRE team (e.g., SRE overlay for major incidents, defined handoffs with service owners, and clear ownership boundaries) to improve response quality without creating single-team dependency.

  • Design and engineer automation and observability solutions by developing tooling, dashboards, and systems to reduce operational toil (measure, report, and drive toil down over time), enhance system visibility, and accelerate delivery.

  • Participates in incident and problem management by serving as incident coordinator for high-severity events, driving cross-functional responses, conducting blameless root cause analysis, running post-incident reviews (postmortems) with clear owners and due dates, ensuring remedial actions drive reliability improvements.

  • Oversee operational readiness and performance by managing capacity planning, validating disaster recovery, conducting production readiness reviews, and ensuring systems meet availability, scalability, and recovery expectations.

  • Partner with security, risk, and compliance teams to align reliability goals with governance and compliance requirements, ensuring secure, auditable, and well-documented practices.

  • Collaborate across the organization by working closely with end users, product management, development, architecture, and IT Operational teams to embed reliability principles throughout the software development lifecycle, including service onboarding, reliability reviews, and shared SLO ownership.

  • Champion reliability as a core product feature by promoting reliability throughout all phases of development, advocating for continuous improvement, and communicating key metrics and potential customer impact to stakeholders.

  • Train, mentor, and upskill engineering teams by coaching engineers in SRE practices, supporting junior team members, and fostering a culture of shared ownership and accountability for reliability, including influencing teams without direct authority through standards, data, and executive-aligned priorities. Remain current on the latest SRE trends and best practices, including observability, AI-enabled operations (AIOps), and SLO management, and implement these methodologies to effectively support desired business outcomes. Evaluate AI tools for reliability with security/privacy/compliance guardrails (e.g., data handling, prompt/content controls, auditability) and measure impact.

  • Participate in on-call rotations and operational support for SRE-supported systems and products.

Summary of Qualifications:

  • Experience in Site Reliability Engineering with a track record of delivering measurable improvements in uptime, scalability, release stability, and overall reliability in complex enterprise environments.

  • Demonstrated experience standing up or significantly maturing an SRE practice (operating model, SRE/service engagement, production readiness, incident/postmortem program, and reliability roadmap).

  • Hands-on experience applying AI/ML to operations (AIOps) or GenAI in production support workflows, with a focus on measurable outcomes (MTTD/MTTR, alert fatigue reduction, change failure rate) and responsible use controls.

  • Proven ability to establish Service Level Indicators (SLIs) and SLOs in production environments, including hands-on definition and implementation.

  • Demonstrated background in production incident response, leading resolution efforts, conducting blameless post-incident reviews, and implementing actionable remediation strategies.

  • Strong observability and telemetry expertise in designing instrumentation, building actionable dashboards and alerts, and delivering proactive reliability insights using metrics, logs, and traces.

  • Infrastructure engineering experience with strong Infrastructure as Code skills using tools such as Terraform and Ansible.

  • Thorough understanding and practical experience in CI/CD pipeline design, optimization, and troubleshooting using modern tooling and platforms such as Azure DevOps, GitHub Actions, Jenkins, or GitLab CI, with an emphasis on speed, reliability, and security.

  • Practical knowledge of containerization and platform modernization, including architecting and operating containerized workloads with Docker, VMware, and Kubernetes (or comparable orchestration platforms) to modernize legacy applications and improve fault tolerance.

  • Knowledge of emerging reliability practices, including SLO automation platforms, AIOps, or predictive operations to advance proactive reliability management.

  • Preferred certifications include AWS Professional, Terraform, Ansible, Azure DevOps, Octopus Deploy or other automation-focused credentials that demonstrate continuous technical development.

Education and Experience:

  • Master’s degree in computer science, Engineering, or equivalent practical experience designing and operating production systems at scale.

  • 7+ years of experience in Site Reliability Engineering.

Ideally, the qualified candidate will work at the following location(s): Woodbury, NY; Pittsburgh, PA, Orlando, Fl, South Jordan, UT. A hybrid work model or fully remote model can be considered based on hiring manager decision and priorities of the role.

 

The salary range for this position, if located in NY Metro/NY State is $146,032 to $162,257. However, please note that the salary range will vary for other geographic areas.

#INDHP

Our Employee Value Proposition

  • Competitive Pay, including a Bonus Target or Variable Pay Incentive Program 
  • Benefits Package -Medical, Dental, and Vision (plus much more) 
  • 401(k) Plan with Company Match 
  • Short- & Long-Term Disability 
  • Wellness Programs 
  • Group Life and AD&D Insurance 
  • Paid Vacation, Sick Days and bank Holidays 
  • Employee Engagement Activities including Employee Appreciation Day, DEI Employee Resource Groups, Corporate Social Responsibility, Service Recognition


We offer a total rewards package comprised of a competitive base rate of pay, variable pay incentive programs based on the role, and a comprehensive benefit suite.  Offered rates of pay are determined based on job-related knowledge, relevant experience, skills, certifications, and geographic location.


We are proud to be an equal opportunity employer. All qualified applicants will receive consideration without regard to age, race, color, sex, or gender identity/expression (including pregnancy, childbirth, transgender status, or sexual orientation), religion or creed, ancestry, citizenship, national origin, disability, military or veteran status, marital status, genetic information, or any other characteristic protected by applicable law.

 

We do not tolerate discrimination, harassment, or retaliation. Employment decisions are based solely on qualifications, merit, and business needs. Everyone is welcome here, and we hire based on your ability to do the job, not any protected characteristics.

 

If you need help or reasonable accommodation during the application or hiring process, please let your TA Partner know.


Skills Required

  • Experience in Site Reliability Engineering with measurable improvements in uptime and reliability
  • Experience in establishing SLIs and SLOs in production environments
  • Hands-on experience applying AI/ML in production support workflows
  • Strong Infrastructure as Code skills using Terraform and Ansible
  • Education: Master’s degree in computer science, Engineering, or equivalent experience
  • 7+ years of experience in Site Reliability Engineering
  • Proficiency in CI/CD pipeline design and optimization
  • Experience with modern containerization technologies like Docker and Kubernetes
  • Knowledge of emerging reliability practices and AIOps
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Woodbury, NY
730 Employees
Year Founded: 1987

What We Do

Cardworks is one of the largest privately held providers of end-to-end operational servicing and support functions for credit card and installment loan products in North America. As a leading consumer firm, we service our consumer and small business loan clients across the credit spectrum, from super-prime to non-prime, and provide comprehensive support to bank and non-bank lenders in the United States and Canada. Our management expertise and customized servicing solutions enable banks and financial institutions to mitigate risk, increase profitability, and support their customers. Cardworks is also the parent of Merrick Bank Corporation, a top-15 issuer of credit cards, top 15 merchant acquiring bank, and leader in the recreational vehicle lending industry. As a CardWorks employee, you are at the very heart of all that we do. Our corporate success is based on your contributions. The most valuable resource we have at CardWorks is our employees. Each individual has an impact on how well we execute and on whether we achieve our enterprise objectives

Similar Jobs

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
10 Locations
5550 Employees
127K-249K Annually

Milestone Systems Logo Milestone Systems

Site Reliability Engineer

Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
Remote or Hybrid
United States
1500 Employees
160K-180K Annually

HiBob Logo HiBob

Product Support Specialist

HR Tech • Information Technology • Professional Services • Sales • Software
Remote or Hybrid
United States
1350 Employees
62K-75K Annually

Domino Data Lab Logo Domino Data Lab

Team Lead

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
175K-220K Annually

Similar Companies Hiring

Granted Thumbnail
Mobile • Insurance • Healthtech • Financial Services • Artificial Intelligence
New York, New York
23 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
31 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account