Goldman Sachs Jobs

Workplace Platforms - Site Reliability Engineer (SRE) Lead - Dallas

Goldman Sachs

Workplace Platforms - Site Reliability Engineer (SRE) Lead - Dallas

Posted 2 Days Ago

Be an Early Applicant

Dallas, TX, USA

In-Office

Senior level

Fintech • Financial Services

The Role

Lead reliability engineering for endpoint compute platforms (physical, virtual, cloud desktops) and supporting services. Define SLOs/SLIs, observability, failure models, runbooks, and automation. Drive incident response, post-incident remediation, and resilience improvements. Partner with security, identity, and platform teams to align risk and governance. Mentor engineers and communicate reliability posture to leadership, improving operability and reducing incident frequency and impact.

Summary Generated by Built In

Team Overview

The Workplace Engineering organization is responsible for the reliability, resilience, and operational integrity of the firm’s endpoint compute platforms and services, including:

Corporate‑owned physical devices
Virtual and cloud‑hosted desktops
Core endpoint services such as device lifecycle management, access and identity integration, profile and session services, and application delivery frameworks

The Endpoint Compute SRE function applies Site Reliability Engineering (SRE) principles to ensure these platforms and services are highly available, observable, scalable, and recoverable, while meeting operational and regulatory expectations.

Role Summary

We are seeking an Endpoint Compute SRE Lead to own reliability engineering and operational excellence across endpoint compute platforms and their foundational services.

This role is focused on systems and services, not applications, and covers the reliability of:

Endpoint compute platforms (physical, virtual, cloud desktops)
Device and desktop lifecycle services
Access and sign‑in dependency platforms
Profile, policy, and session services
Application delivery and execution frameworks (packaging, deployment, availability—not app functionality)

The successful candidate will define service-level objectives, observability strategies, failure models, and operational practices that ensure a predictable and resilient end‑user compute experience at enterprise scale.

Job Responsibilities

Reliability Engineering Across Endpoint Services

Own end-to-end reliability of endpoint compute platforms and supporting services
Define service boundaries, dependencies, and critical paths from user sign‑in through productive desktop use
Model failure modes and blast radius across lifecycle, access, and delivery services
Drive designs that support graceful degradation and fast recovery

Observability & Telemetry

Establish observability standards across endpoint compute services, including:
- Enrollment and provisioning success rates
- Access and session establishment health
- Policy and profile delivery latency/failures
- Application delivery availability
Ensure telemetry enables:
- Fast incident detection
- Root cause analysis
- Proactive trend identification

SLOs, SLIs & Error Budgets

Define SLOs and SLIs for key endpoint services (e.g., sign‑in success, provisioning time, policy convergence)
Implement error budget frameworks to guide change, security control rollout, and platform evolution
Use reliability signals to influence platform design and operational priorities

Incident, Problem & Resilience Management

Lead reliability aspects of incident response involving endpoint compute or services
Drive post‑incident reviews focused on systemic corrections
Identify recurring failure patterns in:
- Lifecycle flows
- Access paths
- Policy or profile delivery
Sponsor and track permanent fixes, not workarounds

Operational Excellence & Automation

Define and maintain runbooks, playbooks, and escalation models for endpoint services
Drive automation to reduce:
- Manual remediation
- Repeat incidents
- Operational toil
Influence engineering designs to improve operability and debuggability

Risk & Governance Alignment

Partner with Technology Risk and Security teams to:
- Demonstrate reliability and recoverability controls
- Support operational risk and resilience assessments
- Provide audit‑ready evidence for availability and incident management
Ensure reliability metrics support control effectiveness narratives

Leadership & Collaboration

Act as the reliability authority for endpoint compute and services
Partner closely with:
- Endpoint platform engineers
- Device management teams
- Security engineering and identity teams
Mentor engineers in applying SRE principles to workplace platforms
Communicate reliability posture clearly to leadership

Basic Qualifications

8+ years in SRE, platform operations, reliability engineering, or workplace infrastructure roles
Strong experience operating endpoint compute platforms and core supporting services at enterprise scale
Proven ability to define and implement:
- Observability frameworks
- SLOs / SLIs
- Incident and problem management models
Strong systems thinking across lifecycle, access, and service dependencies
Excellent documentation and communication skills

Preferred Qualifications

Experience applying SRE concepts to end‑user computing or digital workplace platforms
Deep understanding of:
- Device lifecycle and provisioning services
- Identity and access dependencies (availability-focused)
- Profile, policy, and session orchestration
Experience in regulated or high‑assurance environments
Strong ability to influence architecture using data‑driven reliability insights

What Success Looks Like

Endpoint compute and services have clear reliability targets
Lifecycle, access, and delivery failures are predictable, observable, and fast to remediate
Incidents are less frequent, shorter, and less impactful
Platforms are designed with operability and resilience built in
Leadership has confidence in desktop stability as a service

Skills Required

8+ years in SRE, platform operations, reliability engineering, or workplace infrastructure roles
Strong experience operating endpoint compute platforms and core supporting services at enterprise scale
Proven ability to define and implement observability frameworks
Proven ability to define and implement SLOs / SLIs and error budget frameworks
Proven ability to define and implement incident and problem management models
Strong systems thinking across lifecycle, access, and service dependencies
Excellent documentation and communication skills
Experience applying SRE concepts to end-user computing or digital workplace platforms
Deep understanding of device lifecycle and provisioning services
Understanding of identity and access availability dependencies
Experience with profile, policy, and session orchestration
Experience in regulated or high-assurance environments
Ability to influence architecture using data-driven reliability insights

Goldman Sachs Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Goldman Sachs and has not been reviewed or approved by Goldman Sachs.

Healthcare Strength — Coverage includes medical, dental, vision, disability, life and accident insurance, with multiple plan options and most premiums subsidized; coverage often starts on day one. Wellness resources, on-site health centers in some locations, and EAP access reinforce the depth of health support.
Parental & Family Support — Family care includes on-site childcare in some offices, expectant parent resources, and transitional programs for returning parents. Feedback suggests parental leave is very generous, with reports of around 20 weeks paid leave and stipends for adoption, surrogacy, and fertility-related services.
Retirement Support — The firm provides a 401(k) plan with employer matching contributions and broad financial education to help employees plan for retirement. Resources also support saving for education and preparing for unexpected events.

Learn more about Goldman Sachs's Compensation & Benefits →

Goldman Sachs Insights

What's It Like to Work at Goldman Sachs? Goldman Sachs Culture & Values Goldman Sachs Career Growth & Development What's the Work-Life Balance Like at Goldman Sachs? Goldman Sachs Leadership & Management Goldman Sachs Company Growth, Stability & Outlook

View all jobs at Goldman Sachs

View Goldman Sachs Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: New York, NY

67,118 Employees

What We Do

At Goldman Sachs, we believe progress is everyone’s business. That’s why we commit our people, capital and ideas to help our clients, shareholders and the communities we serve to grow. Founded in 1869, Goldman Sachs is a leading global investment banking, securities and investment management firm. Headquartered in New York, we maintain offices in all major financial centers around the world. More about our company can be found at www.goldmansachs.com