CardWorks Jobs

Lead, Site Reliability Engineer

CardWorks

Lead, Site Reliability Engineer

Reposted 6 Days Ago

Be an Early Applicant

Pittsburgh, PA, USA

In-Office

146K-162K Annually

Senior level

Financial Services

The Role

The Lead Site Reliability Engineer will establish the SRE operating model, implement AI-enabled reliability use cases, manage reliability metrics, and oversee operational readiness while collaborating with teams and mentoring engineers.

Summary Generated by Built In

Become an everyday champion — and build a career where your impact fuels financial progress.

What We Do

CardWorks Financial Group is a diversified financial services platform building ethical solutions across credit, lending, and the full customer lifecycle. Through our family of companies, CardWorks Financial Group tackles the complex challenges that larger financial institutions leave behind. We’re embedded throughout the credit card ecosystem as a lender, servicer, and merchant acquirer.

Who We Are

Merrick Bank: The bank that builds
CardWorks Servicing: One partner, total performance
Carson Smithfield: Resolution with respect

With nearly 40 years of operating history, our track record is solid: disciplined in downturns and built to accelerate in recovery. The CardWorks Financial Group companies take precise approach in complex markets, as a top three non-prime focused general purpose card issuer and a top fifteen U.S. merchant acquirer.

Our team tackles the industry’s most complex credit and payment challenges. And we believe that excellent work starts with a team that feels supported, respected, and empowered to grow.

CardWorks Servicing, LLC provides end-to end operational servicing functions for credit cards, secured cards, and installment loans.  We service consumer and small business loans across the credit spectrum and offers backup servicing and due diligence services to capital providers and trustees.

Founded in 1997, Merrick Bank is an FDIC®-insured financial institution headquartered in South Jordan, Utah, with over $10 billion in assets. A wholly owned subsidiary of CardWorks Financial Group, Merrick Bank serves roughly five million cardmembers and more than 100,000 merchant customers, offering credit cards, recreational loans, deposit accounts, merchant services and bank sponsorships to consumers and businesses.

Carson Smithfield, LLC provides a variety of post-charge-off debt recovery services, including digital self-service, IVR, live agent, and external agency management.

Essential Functions:

Establish the SRE operating model (service onboarding, engagement model, governance, reliability reviews, production readiness standards, and quarterly planning) and ensure it is adopted across teams.
Identify, pilot, and operationalize AI-enabled reliability use cases (e.g., alert noise reduction, incident summarization, correlation/root-cause hypothesis generation, runbook assistance, and auto-remediation with human approval) with appropriate guardrails.
Define, implement, and operationalize reliability metrics by establishing and managing SLIs, SLOs, and error budgets to quantify and continuously improve service reliability, supporting engineering and business decisions.
Own the centralized SRE service engagement model by defining service tiers, onboarding criteria, reliability standards, and a transparent intake/prioritization process aligned to business criticality.
Define and enforce error budget policies (including escalation paths and release risk decisions) in partnership with Product/Engineering, using SLO attainment to guide trade-offs between feature velocity and reliability
Establish and maintain centralized “paved road” reliability standards and assets (instrumentation conventions, golden signals, alerting standards, runbook templates, SLO dashboards) that product teams can adopt with minimal friction.
Design the on-call and escalation model for a centralized SRE team (e.g., SRE overlay for major incidents, defined handoffs with service owners, and clear ownership boundaries) to improve response quality without creating single-team dependency.
Design and engineer automation and observability solutions by developing tooling, dashboards, and systems to reduce operational toil (measure, report, and drive toil down over time), enhance system visibility, and accelerate delivery.
Participates in incident and problem management by serving as incident coordinator for high-severity events, driving cross-functional responses, conducting blameless root cause analysis, running post-incident reviews (postmortems) with clear owners and due dates, ensuring remedial actions drive reliability improvements.
Oversee operational readiness and performance by managing capacity planning, validating disaster recovery, conducting production readiness reviews, and ensuring systems meet availability, scalability, and recovery expectations.
Partner with security, risk, and compliance teams to align reliability goals with governance and compliance requirements, ensuring secure, auditable, and well-documented practices.
Collaborate across the organization by working closely with end users, product management, development, architecture, and IT Operational teams to embed reliability principles throughout the software development lifecycle, including service onboarding, reliability reviews, and shared SLO ownership.
Champion reliability as a core product feature by promoting reliability throughout all phases of development, advocating for continuous improvement, and communicating key metrics and potential customer impact to stakeholders.
Train, mentor, and upskill engineering teams by coaching engineers in SRE practices, supporting junior team members, and fostering a culture of shared ownership and accountability for reliability, including influencing teams without direct authority through standards, data, and executive-aligned priorities. Remain current on the latest SRE trends and best practices, including observability, AI-enabled operations (AIOps), and SLO management, and implement these methodologies to effectively support desired business outcomes. Evaluate AI tools for reliability with security/privacy/compliance guardrails (e.g., data handling, prompt/content controls, auditability) and measure impact.
Participate in on-call rotations and operational support for SRE-supported systems and products.

Summary of Qualifications:

Experience in Site Reliability Engineering with a track record of delivering measurable improvements in uptime, scalability, release stability, and overall reliability in complex enterprise environments.
Demonstrated experience standing up or significantly maturing an SRE practice (operating model, SRE/service engagement, production readiness, incident/postmortem program, and reliability roadmap).
Hands-on experience applying AI/ML to operations (AIOps) or GenAI in production support workflows, with a focus on measurable outcomes (MTTD/MTTR, alert fatigue reduction, change failure rate) and responsible use controls.
Proven ability to establish Service Level Indicators (SLIs) and SLOs in production environments, including hands-on definition and implementation.
Demonstrated background in production incident response, leading resolution efforts, conducting blameless post-incident reviews, and implementing actionable remediation strategies.
Strong observability and telemetry expertise in designing instrumentation, building actionable dashboards and alerts, and delivering proactive reliability insights using metrics, logs, and traces.
Infrastructure engineering experience with strong Infrastructure as Code skills using tools such as Terraform and Ansible.
Thorough understanding and practical experience in CI/CD pipeline design, optimization, and troubleshooting using modern tooling and platforms such as Azure DevOps, GitHub Actions, Jenkins, or GitLab CI, with an emphasis on speed, reliability, and security.
Practical knowledge of containerization and platform modernization, including architecting and operating containerized workloads with Docker, VMware, and Kubernetes (or comparable orchestration platforms) to modernize legacy applications and improve fault tolerance.
Knowledge of emerging reliability practices, including SLO automation platforms, AIOps, or predictive operations to advance proactive reliability management.
Preferred certifications include AWS Professional, Terraform, Ansible, Azure DevOps, Octopus Deploy or other automation-focused credentials that demonstrate continuous technical development.

Education and Experience:

Master’s degree in computer science, Engineering, or equivalent practical experience designing and operating production systems at scale.
7+ years of experience in Site Reliability Engineering.

Ideally, the qualified candidate will work at the following location(s): Woodbury, NY; Pittsburgh, PA, Orlando, Fl, South Jordan, UT. A hybrid work model or fully remote model can be considered based on hiring manager decision and priorities of the role.

The salary range for this position, if located in NY Metro/NY State is $146,032 to $162,257. However, please note that the salary range will vary for other geographic areas.

#INDHP

Our Employee Value Proposition

Competitive Pay, including a Bonus Target or Variable Pay Incentive Program
Benefits Package -Medical, Dental, and Vision (plus much more)
401(k) Plan with Company Match
Short- & Long-Term Disability
Wellness Programs
Group Life and AD&D Insurance
Paid Vacation, Sick Days and bank Holidays
Employee Engagement Activities including Employee Appreciation Day, DEI Employee Resource Groups, Corporate Social Responsibility, Service Recognition

We offer a total rewards package comprised of a competitive base rate of pay, variable pay incentive programs based on the role, and a comprehensive benefit suite.  Offered rates of pay are determined based on job-related knowledge, relevant experience, skills, certifications, and geographic location.

We are proud to be an equal opportunity employer. All qualified applicants will receive consideration without regard to age, race, color, sex, or gender identity/expression (including pregnancy, childbirth, transgender status, or sexual orientation), religion or creed, ancestry, citizenship, national origin, disability, military or veteran status, marital status, genetic information, or any other characteristic protected by applicable law.

We do not tolerate discrimination, harassment, or retaliation. Employment decisions are based solely on qualifications, merit, and business needs. Everyone is welcome here, and we hire based on your ability to do the job, not any protected characteristics.

If you need help or reasonable accommodation during the application or hiring process, please let your TA Partner know.

Skills Required

Experience in Site Reliability Engineering with measurable improvements in uptime and reliability
Experience in establishing SLIs and SLOs in production environments
Hands-on experience applying AI/ML in production support workflows
Strong Infrastructure as Code skills using Terraform and Ansible
Education: Master’s degree in computer science, Engineering, or equivalent experience
7+ years of experience in Site Reliability Engineering
Proficiency in CI/CD pipeline design and optimization
Experience with modern containerization technologies like Docker and Kubernetes
Knowledge of emerging reliability practices and AIOps

CardWorks Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about CardWorks and has not been reviewed or approved by CardWorks.

Healthcare Strength — The company provides multiple medical, dental, and vision options with preventive care supported in-network and access to mental-health resources via an EAP. Feedback suggests coverage breadth is solid and comparable to mainstream employers.
Retirement Support — A 401(k) with a company match, Roth availability, and defined eligibility/vesting supports long-term savings. This structure adds tangible value to total compensation beyond base pay.
Wellbeing & Lifestyle Benefits — Programs such as EAP counseling, wellness platforms, and specialty care services contribute to non-cash value. Employee resource groups and events further round out the perks experience for those who value community and programming.

Learn more about CardWorks's Compensation & Benefits →

CardWorks Insights

What's It Like to Work at CardWorks? CardWorks Culture & Values CardWorks Career Growth & Development What's the Work-Life Balance Like at CardWorks? CardWorks Leadership & Management CardWorks Company Growth, Stability & Outlook

View all jobs at CardWorks

View CardWorks Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Woodbury, NY

730 Employees

Year Founded: 1987

What We Do

Cardworks is one of the largest privately held providers of end-to-end operational servicing and support functions for credit card and installment loan products in North America. As a leading consumer firm, we service our consumer and small business loan clients across the credit spectrum, from super-prime to non-prime, and provide comprehensive support to bank and non-bank lenders in the United States and Canada. Our management expertise and customized servicing solutions enable banks and financial institutions to mitigate risk, increase profitability, and support their customers. Cardworks is also the parent of Merrick Bank Corporation, a top-15 issuer of credit cards, top 15 merchant acquiring bank, and leader in the recreational vehicle lending industry. As a CardWorks employee, you are at the very heart of all that we do. Our corporate success is based on your contributions. The most valuable resource we have at CardWorks is our employees. Each individual has an impact on how well we execute and on whether we achieve our enterprise objectives