SRE Lead

Posted 2 Days Ago
Be an Early Applicant
Dallas, TX
Senior level
Information Technology • Consulting
The Role
The SRE Lead will oversee the reliability and performance of cloud-based and on-premises platforms, lead a team of engineers, implement observability solutions, and drive operational excellence. Responsibilities include managing incidents, developing automation tools, conducting POCs for generative AI applications, mentoring team members, and staying updated on AI advancements.
Summary Generated by Built In

Our Company

We’re Hitachi Digital Services, a global digital solutions and transformation business with a bold vision of our world’s potential. We’re people-centric and here to power good. Every day, we future-proof urban spaces, conserve natural resources, protect rainforests, and save lives. This is a world where innovation, technology, and deep expertise come together to take our company and customers from what’s now to what’s next. We make it happen through the power of acceleration.

Imagine the sheer breadth of talent it takes to bring a better tomorrow closer to today. We don’t expect you to ‘fit’ every requirement – your life experience, character, perspective, and passion for achieving great things in the world are equally as important to us.

The team

At Hitachi Digital Services, our team is driven by a shared passion for innovation, collaboration, and creating transformative solutions that impact the world. As part of our Dallas-based SRE team, you will join a diverse, inclusive, and supportive environment that values continuous learning, cutting-edge technologies, and empowering individuals to lead impactful change. Together, we engineer reliability and performance for critical systems, ensuring our solutions are robust, efficient, and scalable.

This hybrid role requires you to be on-site three days a week. We prefer local candidates as relocation assistance is not available.

The role

As an SRE Lead, you will play a pivotal role in ensuring the availability, reliability, and performance of our cloud-based and on-premises platforms. You'll lead a talented team of engineers to troubleshoot, optimize, and drive operational excellence while championing automation and SRE best practices. In this role, you'll define and manage incident processes, lead generative AI platform initiatives, and mentor team members to align with the highest standards of operational excellence. You will also have the unique opportunity to drive innovation in generative AI applications, working with cutting-edge technologies to shape the future of AI-driven systems.

This position is ideal for individuals who thrive in a dynamic environment and are eager to lead with creativity, problem-solving skills, and a commitment to continuous improvement.  

What You’ll Be Doing

  • Leading a team of platform, application, and incident SREs to manage and resolve complex production issues.
  • Improving application performance, availability, and reliability.
  • Implementing observability solutions for proactive issue identification and optimization.
  • Managing processes for incidents, changes, releases, and deployments.
  • Developing automation tools (IaC, alert as code, dashboard as code) to enhance efficiency.
  • Conducting POCs to implement tools supporting generative AI platforms.
  • Analyzing trends in incidents, problems, and alerts to drive operational improvements.
  • Documenting SOPs, critical systems information, and best practices for current and future use.
  • Providing technical guidance and mentorship to junior SRE team members.
  • Staying updated on advancements in generative AI technologies and responsible AI practices.

What you’ll bring

  • Proven experience with SRE principles and practices in managing on-premises and cloud applications.
  • Knowledge of generative AI applications and related technologies.
  • Strong leadership skills, with the ability to drive team performance and continuous improvement.
  • Analytical skills for resolving complex technical issues, ensuring system reliability, and minimizing downtime.
  • Excellent communication and collaboration skills to work effectively with cross-functional teams.

Mandatory Skills

  • Expertise in SRE principles: anomaly detection, root cause analysis, and predictive maintenance.
  • Proficiency in defining SLIs, SLOs, and error budgets.
  • Experience leading an operations team in application production environments.
  • Knowledge of scripting languages (e.g., Java, Python, PowerShell).
  • Hands-on experience with Kubernetes and OpenTelemetry.
  • Understanding of generative AI, large language models (LLMs), and responsible AI.
  • Familiarity with DevOps methodologies, tools, and automation (e.g., CI/CD pipelines, Terraform, Helm).
  • Experience with public/private cloud platforms (e.g., AWS, Azure, GCP).

Preferred Skills

  • Knowledge of fine-tuning models, prompt engineering, retrieval-augmented generation (RAG), and cost optimization techniques.

About us

We’re a global, team of innovators. Together, we harness engineering excellence and passion to co-create meaningful solutions to complex challenges. We turn organizations into data-driven leaders that can make a positive impact on their industries and society. If you believe that innovation can bring a better tomorrow closer to today, this is the place for you.

LI-YM1

Championing diversity, equity, and inclusion

Diversity, equity, and inclusion (DEI) are integral to our culture and identity. Diverse thinking, a commitment to allyship, and a culture of empowerment help us achieve powerful results. We want you to be you, with all the ideas, lived experience, and fresh perspective that brings. We support your uniqueness and encourage people from all backgrounds to apply and realize their full potential as part of our team.

How we look after you

We help take care of your today and tomorrow with industry-leading benefits, support, and services that look after your holistic health and wellbeing. We’re also champions of life balance and offer flexible arrangements that work for you (role and location dependent). We’re always looking for new ways of working that bring out our best, which leads to unexpected ideas. So here, you’ll experience a sense of belonging, and discover autonomy, freedom, and ownership as you work alongside talented people you enjoy sharing knowledge with.

We’re proud to say we’re an equal opportunity employer and welcome all applicants for employment without attention to race, colour, religion, sex, sexual orientation, gender identity, national origin, veteran, age, disability status or any other protected characteristic. Should you need reasonable accommodations during the recruitment process, please let us know so that we can do our best to set you up for success.


Top Skills

Sre
The Company
Hyderabad
1,644 Employees
On-site Workplace

What We Do

Hitachi Digital Services is an independent services business that focuses on delivering a unified operating model for cloud, data, IoT and managed services.

Playing a pivotal role in Hitachi's digital transformation strategy, Hitachi Digital Services places a strong emphasis on Generative AI to deliver an integrated end-to-end digital transformation for enterprises. The company is strategically positioned within the Hitachi Digital portfolio of companies to leverage the synergies between operational technology (OT), information technology (IT), and product and service offerings.

Such positioning allows Hitachi Digital Services to work closely with Hitachi Digital, the new Hitachi Vantara and Hitachi group businesses, including GlobalLogic, to create an integrated end-to-end digital transformation solution for enterprises

Similar Jobs

Capital One Logo Capital One

Lead Platform Engineer, Site Reliability Engineering (SRE)

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
Plano, TX, USA
55000 Employees

OCC Logo OCC

Manager, Software Engineering

Big Data • Cloud • Fintech • Information Technology • Financial Services
Hybrid
Dallas, TX, USA
1033 Employees
159K-213K Annually

Cloudflare Logo Cloudflare

Frontend Engineer, Web Performance

Cloud • Information Technology • Security • Software • Cybersecurity
Hybrid
Austin, TX, USA
3900 Employees
115K-198K Annually

Cloudflare Logo Cloudflare

Software Engineer, Magic Visibility

Cloud • Information Technology • Security • Software • Cybersecurity
Remote
Hybrid
Austin, TX, USA
3900 Employees
137K-240K Annually

Similar Companies Hiring

Silverfort Thumbnail
Security • Sales • Information Technology • Cybersecurity • Automation
GB
357 Employees
Jobba Trade Technologies, Inc. Thumbnail
Software • Professional Services • Productivity • Information Technology • Cloud
Chicago, IL
45 Employees
InCommodities Thumbnail
Renewable Energy • Machine Learning • Information Technology • Energy • Automation • Analytics
Austin, TX
234 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account