This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from an HPE office.
Who We Are:
Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever they live, from edge to cloud, so they can turn insights into outcomes at the speed required to thrive in today’s complex world. Our culture thrives on finding new and better ways to accelerate what’s next. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good. If you are looking to stretch and grow your career our culture will embrace you. Open up opportunities with HPE.
Job Description:
Compute at HPE helps organizations power their edge-to-cloud platform with proven, workload-optimized products, solutions, and services. Our leading supercomputing technologies enable customers to transform and modernize their IT infrastructure, solve complex problems and support new business opportunities with purpose-built infrastructure and software. Join us redefine what’s next for you.
The Senior Cloud Reliability Engineer is responsible for ensuring the cloud-based infrastructure and applications are observable, resilient, and well-supported. This role focuses on enabling partner development teams to operate effectively by enhancing observability, establishing clear incident response protocols, maintaining comprehensive runbooks and documentation, conducting readiness exercises, and integrating application security best practices. The aim is to foster a reliable, secure, and efficient cloud environment that supports continuous delivery and operational excellence. The team you would be part of in this role is the Compute Reliability Engineering team which acts as a catalyst to create a culture of operationally excellent engineers.
Management Level Definition:
This is a senior technical position requiring extensive experience in cloud operations, reliability, and developer enablement. The individual works autonomously to improve operational readiness, guides others in best practices, and leads initiatives to enhance system reliability. Leadership in planning, problem resolution, and cross-team collaboration is essential.
Responsibilities:
- Develop and maintain comprehensive observability solutions (metrics, logs, traces) to enable rapid issue detection and resolution.
- Define, document, and contribute to incident response protocols and escalation procedures.
- Create, review, and improve runbooks, documentation, and onboarding materials for support and engineering teams.
- Organize and facilitate gamedays or simulated incident response exercises to test and improve team readiness.
- Collaborate with development teams to embed Application Security (AppSec) best practices into cloud solutions and operational processes.
- Monitor system health and performance, and implement automation to reduce manual intervention.
- Provide guidance and training to engineering teams to support operational excellence and reliability.
- Stay current with industry best practices in cloud reliability, observability, security, and incident management.
Education and Experience Required:
- Bachelor’s degree in Computer Science, Information Technology, or a related field; Master’s preferred.
- 10+ years of experience in cloud operations, site reliability, or cloud development roles.
- Proven experience with cloud platforms such as AWS, Azure, or Google Cloud.
- Experience in designing or maintaining observability, incident response, and operational documentation.
- Familiarity with automation, monitoring tools, and incident management frameworks.
Knowledge and Skills:
- Strong understanding of cloud architecture, services, and security considerations.
- Expertise in observability tools (e.g., Prometheus, Grafana, CloudWatch).
- Knowledge of incident response processes and tooling.
- Ability to create clear, actionable runbooks and documentation.
- Experience with automation and scripting (Python, Bash, etc.).
- Familiarity with DevOps practices, CI/CD pipelines, and infrastructure as code.
- Excellent communication, facilitation, and cross-team collaboration skills.
- Strong problem-solving skills and a proactive approach to operational challenges.
Additional Skills:
What We Can Offer You:
Health & Wellbeing
We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
Personal & Professional Development
We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have — whether you want to become a knowledge expert in your field or apply your skills to another division.
Unconditional Inclusion
We are unconditionally inclusive in the way we work and celebrate individual uniqueness. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good.
Let's Stay Connected:
Follow @HPECareers on Instagram to see the latest on people, culture and tech at HPE.
Job:
EngineeringJob Level:
TCP_04
HPE is an Equal Employment Opportunity/ Veterans/Disabled/LGBT employer. We do not discriminate on the basis of race, gender, or any other protected category, and all decisions we make are made on the basis of qualifications, merit, and business need. Our goal is to be one global team that is representative of our customers, in an inclusive environment where we can continue to innovate and grow together. Please click here: Equal Employment Opportunity.
Hewlett Packard Enterprise is EEO Protected Veteran/ Individual with Disabilities.
HPE will comply with all applicable laws related to employer use of arrest and conviction records, including laws requiring employers to consider for employment qualified applicants with criminal histories.
Recruitment Fraud Alert
We have become aware of an increase in fraudulent recruitment activities in which individuals impersonate our company or authorized recruitment agencies to offer fake employment opportunities. These scams may occur through false websites, emails, social media, or chat-based applications and often aim to obtain personal information or money. Please note that Hewlett Packard Enterprise (HPE), its direct and indirect subsidiaries and affiliated companies, and its authorized recruitment agencies/vendors will never charge a candidate a registration fee, hiring fee, or any other fee in connection with its recruitment and hiring process. We also never request personal information such as back account details, Social Security numbers, or national IDs via social media or chat applications.
All legitimate job opportunities will come through official company channels, and candidates are responsible for verifying the credentials of any third party claiming to represent the company. Any reliance on fraudulent communication is at the individual’s own risk, and HPE disclaims legal liability for any resulting damages. If you suspect recruitment fraud, do not share personal information or make any payments and report the incident to your local authorities immediately.
Skills Required
- Bachelor's degree in Computer Science, Information Technology, or related field
- Master's degree
- 10+ years of experience in cloud operations, site reliability, or cloud development
- Proven experience with cloud platforms (AWS, Azure, or Google Cloud)
- Experience designing or maintaining observability, incident response, and operational documentation
- Expertise with observability tools (Prometheus, Grafana, CloudWatch)
- Experience with automation and scripting (Python, Bash)
- Familiarity with automation, monitoring tools, and incident management frameworks
- Familiarity with DevOps practices, CI/CD pipelines, and Infrastructure as Code
- Excellent communication, facilitation, and cross-team collaboration skills
Hewlett Packard Enterprise Compensation & Benefits Highlights
The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Hewlett Packard Enterprise and has not been reviewed or approved by Hewlett Packard Enterprise.
-
Parental & Family Support — Policies include up to 26 weeks of fully paid parental leave with options for phased return to work, supplemented by backup care resources. Program materials emphasize broad availability, with specifics varying by country and role.
-
Wellbeing & Lifestyle Benefits — Wellbeing initiatives such as Wellness Fridays and 60 hours of paid volunteer time provide additional paid time for rest, community, and flexibility. Hybrid/flexible work and wellbeing resources further reinforce a lifestyle-oriented package.
-
Retirement Support — Offerings include a company 401(k) match, an Employee Stock Purchase Plan, and HSA seeding under certain medical plans. These financial benefits are positioned as solid components of the total rewards mix.
Hewlett Packard Enterprise Insights
What We Do
In 1939, Bill Hewlett and Dave Packard, college friends turned business partners, started the original Silicon Valley startup in the space of a rented Palo Alto garage. Starting with audio oscillators, the friends built the foundation for a company that would grow to become a global leader in enterprise technology. More than 75 years later, our success is exemplified through our employees’ drive to advance ideas that bring meaningful innovations to life for our customers and partners around the globe. We are guided by our mission to help customers use technology to turn ideas into value, and empower them to transform industries, markets and lives. We simplify Hybrid IT, power the Intelligent Edge and provide the expertise to make it all happen.







