Engineering - SRE Platforms - Site Reliability Engineer - Vice President - Dallas

Posted 5 Days Ago
Be an Early Applicant
Dallas, TX, USA
In-Office
Senior level
Fintech • Financial Services
The Role
Lead SRE technical strategy and architecture for highly available, scalable enterprise platforms. Build automation, observability, and incident response practices; mentor senior engineers; drive capacity planning, production reliability, and adoption of SRE best practices across cloud and on-prem environments.
Summary Generated by Built In

Site Reliability Engineer - Vice President

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run scalable, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for improving the availability and reliability of the firm’s most critical platform services and ensures they meet the requirements of our internal and external users. It is also responsible for firmwide policies and standards focused on firm’s digital resilience. We are looking for engineers who are motivated to collaborate with our businesses to build and run sustainable production systems, which can evolve and adapt to changes in our fast-paced, global business environment.

 

The SRE team develops and maintains platforms and tools which help other Engineering teams in Goldman Sachs to build and operate reliable and resilient systems. These systems span on-premises datacenters and multiple public cloud environments.   The platforms we offer include central logging, monitoring, agents and alerting and we provide tools to drive adoption and improvements to capacity planning, operational readiness assessments, production incident postmortems, SLIs / SLOs, and deployment automation including canary releases.

 

The products and services we provide to our internal customers are used by thousands of engineers every day. We believe that reliability is the most important feature of any system, and we are devoted to giving our engineers the platforms and tools they need to build and operate reliable products.

  Role Overview

As a Site Reliability Engineer (SRE) at Goldman Sachs, you will be a pivotal leader in ensuring the availability, reliability, and scalability of the firm's most critical platform applications and services. You will combine deep software and systems engineering expertise to architect, build, and run large-scale, massively distributed, fault-tolerant systems. This role involves providing technical leadership, mentoring senior engineers, and collaborating closely with internal teams and executive stakeholders to build and operate sustainable production systems that can adapt to our dynamic global business environment. You will drive a culture of continuous improvement, championing the adoption of advanced SRE principles and best practices across the organization.

 Responsibilities

  • Strategic Reliability & Performance: Drive the strategic direction for availability, scalability, and performance of mission-critical applications and platform services, ensuring alignment with firm-wide objectives.
  • Architectural Leadership: Lead the design, build, and implementation of highly available, resilient, and scalable infrastructure and application architectures.
  • Advanced Automation & Tooling: Architect and develop sophisticated platforms, tools, and automation solutions to eliminate toil, optimize operational workflows, and enhance deployment processes across the enterprise.
  • Complex Incident Management & Post-Mortem Analysis: Lead critical incident response, conduct in-depth root cause analysis for systemic issues, and implement long-term preventative measures to significantly enhance system stability and resilience.
  • System Design & Capacity Planning: Partner with development teams to embed reliability into application design from inception, provide expert system design consulting, and lead comprehensive capacity planning initiatives for future growth.
  • Observability & Insights: Define and implement advanced monitoring, high volume logging with multi-user query capabilities, and tracing strategies to provide deep, actionable insights into application performance, infrastructure health, and user experience.
  • Technical Vision & Mentorship: Provide technical vision, lead complex technical projects, conduct rigorous code reviews, enforce SDLC best practices, and actively mentor and develop senior and staff-level engineers.
  • Technology Evaluation & Adoption: Stay at the forefront of industry trends and advancements, evaluating and integrating cutting-edge tools and frameworks to significantly improve operational efficiency and reliability.
  • On-Call Leadership: Participate in and lead on-call rotations, providing expert guidance and hands-on support for critical system incidents.
Qualifications
  • Experience: Minimum of 6+ years of hands-on experience in Site Reliability Engineering, with a proven track record in architecting, designing, building, and maintaining highly available, scalable, and fault-tolerant systems at an enterprise level.
  • Technical Proficiency:
    • Exceptional programming skills in one or more major languages such as Java, Python, Go with a focus on building robust, scalable software.
    • Extensive hands-on experience with cloud platforms (e.g., AWS, GCP) and deep expertise in containerization and orchestration technologies (e.g., Docker, Kubernetes).
    • Mastery of Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation) and configuration management tools (e.g., Puppet, Chef, Ansible).
    • Advanced proficiency in Prompt Engineering and Retrieval-Augmented Generation (RAG) architectures to automate complex SRE workflows, such as the generation of Infrastructure as Code (IaC), dynamic runbooks, and incident response summaries.
    • Profound understanding of Linux internals, networking, distributed systems, and advanced system performance tuning.
    • Expertise in designing and implementing comprehensive monitoring, alerting, logging and tracing solutions (e.g., Prometheus, Grafana, ELK stack, Datadog, PagerDuty).
    • Deep experience with CI/CD tools and practices (e.g., Jenkins, GitLab, Maven).
    • Strong foundation in databases and distributed systems.
    • Exceptional problem-solving abilities and analytical skills, with a track record of resolving complex technical challenges.
  • Preferred Experience:
    • Experience with Distributed Databases like Elastic Search
    • Experience with working on GCP Big Query
    • Experience with messaging Systems Like Kafka
  • Education: Advanced degree (Bachelor’s or Mas ter's or PhD) in Computer Science or a related technical field involving coding and/or systems engineering, or equivalent practical experience.
  • Soft Skills: Superior communication, collaboration, and interpersonal skills, with the ability to influence technical direction, lead cross-functional initiatives, and effectively engage with global teams and executive leadership. Proven ability to work independently, manage multiple complex stakeholders, and drive significant organizational change.

Skills Required

  • 6+ years hands-on Site Reliability Engineering experience
  • Programming in Java, Python, or Go
  • Experience with public cloud platforms (AWS, GCP)
  • Containerization and orchestration (Docker, Kubernetes)
  • Infrastructure as Code (Terraform, CloudFormation)
  • Configuration management (Puppet, Chef, Ansible)
  • Advanced proficiency in Prompt Engineering and RAG architectures
  • Deep knowledge of Linux internals, networking, distributed systems, and performance tuning
  • Designing and implementing monitoring, logging and tracing (Prometheus, Grafana, ELK, Datadog, PagerDuty)
  • Experience with CI/CD tools and practices (Jenkins, GitLab, Maven)
  • Strong foundation in databases and distributed systems
  • Advanced degree in Computer Science or related field, or equivalent practical experience
  • Leadership, mentoring, stakeholder management, and strong communication skills
  • On-call leadership and incident management experience
  • Experience with distributed databases like ElasticSearch
  • Experience with GCP BigQuery
  • Experience with messaging systems like Kafka

Goldman Sachs Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Goldman Sachs and has not been reviewed or approved by Goldman Sachs.

  • Healthcare Strength Coverage includes medical, dental, vision, disability, life and accident insurance, with multiple plan options and most premiums subsidized; coverage often starts on day one. Wellness resources, on-site health centers in some locations, and EAP access reinforce the depth of health support.
  • Parental & Family Support Family care includes on-site childcare in some offices, expectant parent resources, and transitional programs for returning parents. Feedback suggests parental leave is very generous, with reports of around 20 weeks paid leave and stipends for adoption, surrogacy, and fertility-related services.
  • Retirement Support The firm provides a 401(k) plan with employer matching contributions and broad financial education to help employees plan for retirement. Resources also support saving for education and preparing for unexpected events.

Goldman Sachs Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: New York, NY
67,118 Employees

What We Do

At Goldman Sachs, we believe progress is everyone’s business. That’s why we commit our people, capital and ideas to help our clients, shareholders and the communities we serve to grow. Founded in 1869, Goldman Sachs is a leading global investment banking, securities and investment management firm. Headquartered in New York, we maintain offices in all major financial centers around the world. More about our company can be found at www.goldmansachs.com

Similar Jobs

Navan Logo Navan

Senior Site Reliability Engineer

Fintech • Information Technology • Payments • Productivity • Software • Travel • Automation
Easy Apply
Hybrid
2 Locations
3300 Employees

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
10 Locations
5550 Employees
127K-249K Annually

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
6 Locations
5550 Employees
126K-248K Annually

Ticketmaster Logo Ticketmaster

Lead Engineer, CSRE

Events • News + Entertainment
In-Office
4 Locations
3850 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account