Site Reliability Engineer

Posted 8 Days Ago
Houston, TX, USA
Hybrid
Senior level
Hardware • Other • Energy
The Role
Maintain and monitor production systems for availability and performance; lead incident response and postmortems; implement observability, alerting, and automated remediation; optimize distributed systems (AKKA.NET) and PostgreSQL; build CI/CD pipelines and infrastructure-as-code.
Summary Generated by Built In

As a Site Reliability Engineer, you will be responsible for: Operational Excellence & Incident Management

- Maintain and monitor production systems for availability, latency, and performance.

- Lead incident response efforts, including communication, resolution, and postmortem documentation.

- Design and implement health checks, alerting systems, and automated remediation workflows.

- Drive root cause analysis and implement permanent resolutions for recurring issues.

Observability & Insights

- Set up and maintain full observability stacks (logging, metrics, tracing) using tools like Prometheus, Grafana, Datadog, OpenTelemetry, or ELK.

- Analyze telemetry and logs to identify trends, anomalies, and opportunities for improvement.

- Conduct post-incident reviews and use insights to inform future engineering investments.

Performance & Systems Optimization

- Tune and optimize distributed systems, including AKKA.NET actors, for performance and resource efficiency.

- Work with developers to evolve architecture and improve system throughput, latency, and stability.

- Optimize PostgreSQL performance, queries, and maintenance strategies.

CI/CD & Automation

- Design and maintain modern CI/CD pipelines using GitHub Actions, Azure Pipelines, or GitLab CI.

- Automate deployment, testing, and rollback processes to reduce friction and increase deployment frequency.

- Standardize infrastructure as code practices across environments.

We’d love to talk to you if you have:

- 5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.

- Expertise in Kubernetes and container orchestration at scale.

- Strong experience with AKKA.NET or similar actor-based frameworks.

- Proficiency with scripting and automation (Bash, PowerShell, Python).

- Experience with observability tools (Phobos,Datadog, Prometheus, Grafana, OpenTelemetry, ELK).

- Hands-on experience with cloud platforms (AWS, Azure, or GCP).

- Strong PostgreSQL knowledge—performance tuning, query optimization, maintenance.

- Proven ability to lead incident management and drive postmortem processes.

- A builder’s mindset with high standards for operational excellence and technical ownership.

Preferred Tools & Ecosystem Experience

- CI/CD: GitHub Actions, Azure Pipelines, GitLab CI

- Infrastructure: Kubernetes, Docker, Terraform

- Monitoring: Phobos (AKKA.NET), Datadog, Prometheus

- Source Control: GitHub, GitLab, Azure DevOps

- Programming: C#, Python, Bash, PowerShell

About UsEvery day, the oil and gas industry’s best minds put more than 150 years of experience to work to help our customers achieve lasting success.
We Power the Industry that Powers the World
Throughout every region in the world and across every area of drilling and production, our family of companies has provided the technical expertise, advanced equipment, and operational support necessary for success—now and in the future.
Global Family
We are a global family of thousands of individuals, working as one team to create a lasting impact for ourselves, our customers, and the communities where we live and work.
Purposeful Innovation
Through purposeful business innovation, product creation, and service delivery, we are driven to power the industry that powers the world better.
Service Above All
This drives us to anticipate our customers’ needs and work with them to deliver the finest products and services on time and on budget.
About the TeamCorporate
Our family of companies is supported by our global Corporate teams, providing expert knowledge from functions including Human Resources, Information Technology, Compliance, Finance, QHSE, Marketing and Legal centers of expertise.  We are structured to provide guidance and service above all to all our business operations.

Skills Required

  • 5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles
  • Expertise in Kubernetes and container orchestration at scale
  • Strong experience with AKKA.NET or similar actor-based frameworks
  • Proficiency with scripting and automation (Bash, PowerShell, Python)
  • Experience with observability tools (Phobos, Datadog, Prometheus, Grafana, OpenTelemetry, ELK)
  • Hands-on experience with cloud platforms (AWS, Azure, or GCP)
  • Strong PostgreSQL knowledge including performance tuning and query optimization
  • Proven ability to lead incident management and drive postmortem processes
  • Experience designing and maintaining CI/CD pipelines (GitHub Actions, Azure Pipelines, GitLab CI)
  • Experience with Docker
  • Experience with Terraform and infrastructure-as-code practices
  • Familiarity with source control platforms (GitHub, GitLab, Azure DevOps)
  • Experience programming in C#
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Houston, TX
26,270 Employees

What We Do

NOV delivers technology-driven solutions to empower the global energy industry. For more than 150 years, NOV has pioneered innovations that enable its customers to safely produce abundant energy while minimizing environmental impact. The energy industry depends on NOV’s deep expertise and technology to continually improve oilfield operations and assist in efforts to advance the energy transition towards a more sustainable future. NOV powers the industry that powers the world.

Similar Jobs

Optum Logo Optum

Site Reliability Engineer

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
In-Office
Richardson, TX, USA
160000 Employees
157K-210K Annually

DraftKings Logo DraftKings

Site Reliability Engineer

Digital Media • Gaming • Information Technology • Software • Sports • Esports • Big Data Analytics
Remote or Hybrid
United States
6400 Employees
200K-250K Annually

Domino Data Lab Logo Domino Data Lab

Site Reliability Engineer

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
200K-230K Annually
In-Office
Dallas, TX, USA
139 Employees

Similar Companies Hiring

Red 6 Thumbnail
Aerospace • Hardware • Software • Virtual Reality • Defense
Orlando, Florida
186 Employees
Blissway Thumbnail
Computer Vision • Fintech • Hardware • Internet of Things • Machine Learning • Software • Transportation
Denver, CO
24 Employees
Fairly Even Thumbnail
Hardware • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account