About the Role
This position participates in an engineering on-call rotation and provides after-hours support for production issue escalations on a rotational basis.
This position is based in the Philadelphia area with a hybrid schedule. Remote arrangements may be considered for exceptional candidates, with occasional travel to Philadelphia required.
You’ll join a team of SREs who work closely with other teams of world-class engineers to tenaciously and creatively solve problems and reduce manual toil wherever possible. We expect AI and automation to be a force multiplier in everything you do — from accelerating root-cause analysis and enriching alerts, to generating runbooks and codifying remediation so that the platform increasingly heals itself.
Successful candidates are heavily results-driven, bring well-established expertise across both traditional and bleeding-edge technology, and have a strong desire to continuously grow and improve themselves and our platform. This is a global operation spanning multiple regions and time zones, and the role demands the flexibility and commitment that a 24/7 payment platform requires.
Primary Responsibilities:
- Build and maintain a comprehensive understanding of the platform and custom application stack.
- Implement, maintain, and continuously improve observability strategies and metrics that ensure complete system health for numerous complex products throughout all stages of the development lifecycle, up to and including production.
- Continuously identify automation opportunities and follow through to successful implementation, applying AI-assisted tooling to accelerate development and reduce manual effort.
- Design, build, and maintain automated remediation and self-healing workflows that detect, triage, and resolve common failure modes with minimal human intervention.
- Leverage AI/ML-driven observability — anomaly detection, alert correlation, and intelligent noise reduction — to surface issues earlier and shorten time to detection.
- Use AI-assisted analysis to accelerate root-cause investigation, enrich incident context, and generate first-draft postmortems and runbooks for human review.
- Handle escalations and collaborate effectively with other team members to quickly determine the root cause of any type of service degradation.
- Implement, maintain, and continuously improve incident response procedures and other operational documentation, automating documentation generation and upkeep wherever practical.
- Assist with troubleshooting and remediation of failed scheduled jobs and data-related concerns.
- Champion responsible, secure adoption of AI tooling across the SRE function — sharing patterns, prompts, and automations that raise the productivity of the whole team
AI Enablement & Automation
- Apply AI-assisted development and operations tools — including Anthropic (Claude), OpenAI (Codex), and Azure AI services (Foundry, Azure SRE Agent) and the agentic workflows built on them — to write, review, and accelerate automation and infrastructure code.
- Build and integrate automation that turns repetitive operational work into codified, repeatable, and self-service workflows.
- Use AIOps and ML-driven observability capabilities within the APM stack for anomaly detection, predictive alerting, and alert correlation.
- Develop and refine prompts, agents, and integrations that connect monitoring, ticketing, and remediation systems into faster end-to-end response loops.
- Evaluate emerging AI tooling for reliability and operations use cases, and advocate for adoption where it delivers measurable improvements in toil reduction, MTTR, or availability.
- Ensure all AI and automation usage adheres to FreedomPay’s security, privacy, and PCI obligations — keeping sensitive data appropriately protected and human review in place for high-impact actions.
Required Background and Experience
- BS degree in Computer Science or equivalent, or equivalent years of relevant experience.
- Minimum of 5 years of hands-on technical experience in highly available, high-throughput, web-based technology environments.
- Demonstrated history of self-directed learning — someone who independently seeks out knowledge, builds new skills without being told to, and doesn’t wait for formal training to close gaps.
- Next-level problem-solving abilities and a strong bias toward practical, proven solutions.
- A track record of identifying and eliminating manual toil through automation.
- Excellent communication and organizational skills, with a strong sense of ownership and service.
Required Technical Skills
- Expert-level proficiency in an enterprise APM platform and its AI/ML-driven (AIOps) capabilities; Dynatrace experience strongly preferred, though deep expertise in comparable tools such as Datadog or New Relic where readily transferable.
- Hands-on experience with AI-assisted development and automation tools — such as Anthropic (Claude), OpenAI (Codex), and Azure AI services (Foundry, Azure SRE Agent) — and a demonstrated ability to apply them to real operational and engineering work.
- Proficiency in scripting and automation — PowerShell and/or Python — to build tooling and remediation workflows.
- Strong SQL / T-SQL skills.
- Solid understanding of core networking concepts: DNS, HTTP/HTTPS, load balancing, and TCP/IP routing and switching.
- Working knowledge of modern technology infrastructure including container orchestration, IaaS/PaaS cloud services, Azure, and VMware.
- Working knowledge of application development processes.
Preferred Technical Skills and Experience
- Proven track record of successfully implementing SLI/SLOs and fostering their adoption across an organization.
- Experience implementing enterprise incident management practices.
- Experience building AIOps or ML-driven automation into production observability and incident response.
- Azure Kubernetes Service (AKS) and broader container orchestration experience.
- Windows Server (IIS) administration.
- PagerDuty Process Automation (formerly Rundeck) or comparable runbook automation platforms.
- Comprehensive experience supporting real-time transaction processing applications.
- PCI policies and best practices.
Additional Experience, a Plus
AI/ML model deployment, evaluation, or operations (MLOps).
Documentation automation and self-service tooling / service catalog implementation.
Experience integrating QA test automation into CI/CD pipelines.
Skills Required
- BS in Computer Science or equivalent experience
- Minimum 5 years hands-on experience in highly available, high-throughput web-based environments
- Expert-level proficiency in an enterprise APM platform and AIOps capabilities
- Dynatrace experience
- Hands-on experience with AI-assisted development and automation tools (Anthropic Claude, OpenAI Codex, Azure AI services)
- Proficiency in scripting and automation (PowerShell and/or Python)
- Strong SQL / T-SQL skills
- Solid understanding of networking concepts: DNS, HTTP/HTTPS, load balancing, TCP/IP routing and switching
- Working knowledge of container orchestration, IaaS/PaaS cloud services, Azure, and VMware
- Working knowledge of application development processes
- Track record of identifying and eliminating manual toil through automation
- Excellent communication, organizational skills, and strong ownership/service orientation
- SLI/SLO implementation experience
- Enterprise incident management practices experience
- Experience building AIOps or ML-driven automation into observability and incident response
- Azure Kubernetes Service (AKS) and broader container orchestration experience
- Windows Server (IIS) administration
- PagerDuty Process Automation (formerly Rundeck) or comparable runbook automation platforms
- Experience supporting real-time transaction processing applications
- Familiarity with PCI policies and best practices
- AI/ML model deployment or MLOps experience
- Documentation automation and self-service tooling or service catalog implementation
- Experience integrating QA test automation into CI/CD pipelines
What We Do
The FreedomPay Commerce Platform is the best way for merchants to simplify complex payment environments. Validated by the PCI Security Standards Council for Point-to-Point Encryption (P2PE) along with EMV, Tokenization, Contactless and DCC capabilities, global leaders in retail, hospitality, gaming, education, healthcare and financial services trust FreedomPay to deliver unmatched security and advanced value added services.







