Site Reliability Engineer (Java focused) Sr or Lead

Reposted 3 Days Ago
Be an Early Applicant
Taylor, TX
In-Office
99K-169K Annually
Senior level
Utilities
The Role
The Site Reliability Engineer will ensure the reliability of Java-based systems, manage monitoring tools, and lead incident responses while mentoring engineers.
Summary Generated by Built In

At ERCOT, our diverse and dynamic work environment provides a platform on which employees can work together to build the future of the Texas power grid and wholesale market utilizing the latest technologies and resources.  We encourage you to join our talented, dedicated workforce to develop world-class solutions for today and tomorrow’s energy challenges while learning new skills and growing your career.

ERCOT is committed to fostering inclusion at all levels of our company. It is the cornerstone of our corporate values of accountability, leadership, innovation, trust, and expertise. We know that individuals with a wide variety of talents, ideas, and experiences propel the innovation that drives our success. An inclusive and diverse workforce strengthens us and allows for a collaborative environment to solve the challenges that face our industry today and in the future.

JOB SUMMARYERCOT is seeking a Senior or Lead Site Reliability Engineer (SRE) with strong Java application expertise to ensure the availability, performance, and reliability of mission-critical systems. This role will follow ERCOT specific SRE process and principles which includes managing site failover between 2 datacenters as well as treating Azure as an extended datacenter in the future. You will work deeply with Java codebases while owning production health and operational excellence.
 
JOB DUTIES INCLUDE:Core Responsibilities

- Own reliability, availability, latency, and scalability of Java-based systems

- Define and track SLIs, SLOs, and error budgets

- Design and maintain monitoring, alerting, logging, and dashboards

- Lead incident response and conduct blameless postmortems

- Reduce operational toil through automation and tooling

- Review system designs for reliability and failure modes

- (Lead level) Establish reliability standards and mentor engineers

Java & Application Responsibilities

- Debug and improve Java applications (Spring Boot preferred)

- Perform JVM tuning and performance analysis

- Diagnose failures across databases, messaging, and APIs

- Partner with development teams to improve resilience and recovery

On-Call & Incident Response

- Participate in an on-call rotation for supported services

- Focus on engineering solutions rather than repetitive manual work

- Emphasis on post-incident learning and automation

- Toil is tracked and actively reduced

EXPERIENCE:

- 5+ years (Senior) or 10+ years (Lead) in SRE, DevOps, or Production Engineering

- Strong Java experience (Spring-based systems)

- Experience with distributed, high-availability systems

- Expertise in observability tools (metrics, logs, traces)

- CI/CD experience (Git, Maven, Jenkins)

- Strong cross-layer debugging skills

-CS or related degree required

PREFERRED

- Python
- Kubernetes or OpenShift
- Microsoft Azure
- Kafka or ActiveMQ
- Infrastructure automation (Terraform, Azure Resource Manager, Ansible, Liquibase)
- Chaos or load testing experience

Observability & Production Tooling

- Strong hands-on experience with observability and APM platforms such as Splunk, Dynatrace, DataDog
- Expertise in using Metrics, Logs, Traces, and Profiling (MLTP) to troubleshoot complex production incidents
- Experience with Grafana LGTM Stack for Observability (Loki - for logs, Grafana - for dashboards and visualization, Tempo - for traces, and Mimir - for metrics)
- Experience correlating application performance data with system behavior to identify root causes and prevent recurrence

WORK LOCATION – Taylor, TX:

  • Employees will be required to be on-site in Taylor, TX at minimum 2 days per week, or more, as needed based on the business needs as determined by management.
  • On-site schedules are flexible or may be rotated based on business needs as determined by the Manager.
  • Remote work is required to be performed from your Texas residence.  
  • Employees may opt to work on-site more than required or 100% of the time.

The foregoing description reflects the minimum qualifications and the essential functions of the position that must be performed proficiently with or without reasonable accommodation for individuals with disabilities.  It is not an exhaustive list of the duties expected to be performed, and management may, at its discretion, revise or require that other or different tasks be performed as assigned.  This job description is not intended to create a contract of employment with ERCOT.  Both ERCOT and the employee may exercise their employment-at-will rights at any time. #LI-IV1

ERCOT is firmly committed to equal employment for all qualified persons without regard to race, sex, medical condition, religion, age, creed, national origin, citizenship status, marital status, sexual orientation, physical or mental disability, ancestry, veteran status, genetic information or any other protected category under federal, state or local law.

Expected Salary Range:

$99,230 - $168,715

Top Skills

Activemq
Ansible
Azure
Datadog
Dynatrace
Git
Grafana
Java
Jenkins
Kafka
Kubernetes
Liquibase
Maven
Openshift
Python
Splunk
Spring Boot
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Taylor, TX
995 Employees

What We Do

The Electric Reliability Council of Texas (ERCOT) manages the flow of electric power to 27 million Texas customers - representing about 90 percent of the state's electric load. As the independent system operator (ISO) for the region, ERCOT schedules power on an electric grid that connects more than 54,100 miles of transmission lines and 1,250+ generation units, including Private Use Networks. ERCOT also performs financial settlement for the competitive wholesale bulk-power market and administers retail switching for 8 million premises in competitive choice areas. ERCOT is a membership-based 501(c)(4) nonprofit corporation, governed by a board of directors and subject to oversight by the Public Utility Commission of Texas and the Texas Legislature.

ERCOT's members include consumers, cooperatives, generators, power marketers, retail electric providers, investor-owned electric utilities (transmission and distribution providers), and municipal-owned electric utilities.

Similar Jobs

Snap Inc. Logo Snap Inc.

Account Manager

Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
Hybrid
Austin, TX, USA
5000 Employees
121K-214K Annually

General Motors Logo General Motors

Staff Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
2 Locations
165000 Employees
184K-275K Annually

General Motors Logo General Motors

District Manager OnStar & Loyalty - Portland, OR

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Remote or Hybrid
United States
165000 Employees

General Motors Logo General Motors

Manager, Network Engineering

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
2 Locations
165000 Employees
160K-240K Annually

Similar Companies Hiring

KUBRA Thumbnail
Utilities • Software • Payments • Mobile • Information Technology • Artificial Intelligence • App development
Mississauga, Ontario
600 Employees
Sierra Space Thumbnail
Utilities • Robotics • Information Technology • Hardware • Defense • Aerospace
Louisville, CO
1600 Employees
Energy CX Thumbnail
Utilities • Professional Services • Greentech • Financial Services • Energy • Consulting • Business Intelligence
Chicago, IL
108 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account