Radiant Digital

Site Reliability Engineering (SRE) Architect

Posted Yesterday

Be an Early Applicant

Dallas, TX, USA

In-Office

Expert/Leader

Information Technology

The Role

Design and architect highly available OSS/BSS and mainframe systems using SRE principles. Lead reliability, observability, automation, disaster recovery, incident management, and cross-functional transformations across hybrid cloud and on‑prem environments for telecom operations.

Summary Generated by Built In

Position Title: Site Reliability Engineering (SRE) Architect (Telecom OSS/BSS & Mainframe)

Location: Dallas – TX, Basking Ridge - NJ, NC, and Tampa – FL.

Work Arrangement: Hybrid/Onsite

Interview Type: video

Must have:

15+ years of progressive experience in enterprise IT and telecommunications environments, with extensive expertise in designing, implementing, and supporting complex OSS/BSS ecosystems that enable large-scale business and network operations.
8+ years of hands-on architecture experience across IBM Mainframe z/OS and midrange platforms (Linux/Solaris), delivering scalable, secure, and highly available enterprise solutions.
Demonstrated expertise in Site Reliability Engineering (SRE) principles, including defining and managing Service Level Objectives (SLOs), Service Level Indicators (SLIs), Error Budgets, reliability governance, and continuous service improvement.
Deep functional and technical knowledge of Telcordia OSS applications, including SWITCH, TIRKS, FACS, WFA, and SOAC, with experience integrating and optimizing telecom operational support systems.
Proven ability to design and implement high-availability, fault-tolerant, resilient, and disaster recovery architectures, ensuring business continuity and mission-critical system reliability.
Strong hands-on expertise with IBM Mainframe technologies, including z/OS internals, JCL, IMS, VSAM, DB2, CICS, system utilities, workload management, performance tuning, and production diagnostics.
Extensive experience implementing observability and monitoring solutions using industry-leading tools such as Splunk, Dynatrace, Instana, IBM NetCool, Grafana, and AppDynamics to improve operational visibility and proactive incident detection.
Proven success in driving automation, self-healing capabilities, infrastructure as code, CI/CD reliability practices, and DevOps/SRE transformation across hybrid cloud and on-premises enterprise environments.
Strong understanding of end-to-end telecommunications business processes, including service provisioning, inventory management, order management, activation, network fulfillment, service assurance, and lifecycle management.
Extensive experience leading major incident management, conducting Root Cause Analysis (RCA), problem management, and implementing preventive measures to significantly improve MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve), system stability, and operational excellence.
Proven ability to collaborate with cross-functional teams including Enterprise Architecture, Infrastructure, Development, Operations, Network Engineering, and business stakeholders to deliver highly reliable, business-critical technology solutions.
Excellent leadership, stakeholder management, and communication skills, with a strong track record of mentoring technical teams, driving reliability engineering best practices, and supporting large-scale enterprise transformation initiatives.

About Us

At Radiant Digital, we provide IT solutions and consulting services to help government agencies and businesses in the USA, Canada, the Middle East, and Southeast Asia. On the federal side, we support agencies like NASA, the Department of State (DOS), the IRS, ACL, ACF,USDA and many others, along with numerous state and local government agencies.

We work with industries like telecom, healthcare, entertainment, oil and gas offering solutions designed to meet their specific needs. We focus on improving systems, making better use of data, and updating applications to keep up with changing markets.

Skills Required

15+ years enterprise IT and telecommunications experience designing, implementing, and supporting complex OSS/BSS ecosystems
8+ years architecture experience across IBM Mainframe z/OS and midrange platforms (Linux/Solaris)
Expertise in Site Reliability Engineering principles including SLOs, SLIs, Error Budgets, and reliability governance
Deep functional and technical knowledge of Telcordia OSS applications: SWITCH, TIRKS, FACS, WFA, SOAC
Design and implement high-availability, fault-tolerant, resilient, and disaster recovery architectures
Strong hands-on expertise with IBM Mainframe technologies: z/OS internals, JCL, IMS, VSAM, DB2, CICS, system utilities, workload management, performance tuning, production diagnostics
Experience implementing observability and monitoring with Splunk, Dynatrace, Instana, IBM NetCool, Grafana, AppDynamics
Proven experience driving automation, self-healing, Infrastructure as Code, CI/CD reliability practices, and DevOps/SRE transformations
Strong understanding of telecom business processes: service provisioning, inventory, order management, activation, fulfillment, assurance, lifecycle management
Extensive experience leading major incident management, RCA, problem management to improve MTTD/MTTR and system stability
Ability to collaborate with Enterprise Architecture, Infrastructure, Development, Operations, Network Engineering, and business stakeholders
Excellent leadership, stakeholder management, mentoring, and communication skills to drive reliability engineering best practices