At Roche you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop and cure diseases and ensure everyone has access to healthcare today and for generations to come. Join Roche, where every voice matters.
The PositionJob description
As an Infrastructure Provisioning and Management Engineer within the Accelerated Compute Engineering (ACE) team, you will be responsible for overseeing and advancing our core infrastructure management and provisioning tech stack. This role has a strong focus on driving configuration-as-code, infrastructure-as-code (IaC), and modern automated provisioning best practices across our high-performance compute (HPC) and industry-leading AI Factory.
You will own the lifecycle, deployment, and optimization of bare-metal and virtualized compute environments that power Roche's advanced computing initiatives. By treating infrastructure strictly as code and eliminating manual configurations, you will ensure our advanced clusters are highly reproducible, securely patched, and rapidly scalable to meet the evolving demands of computational science and large-scale AI workloads.
Description of the area
Hosting and Infrastructure (HI) provides mission-critical on-premise infrastructure, cloud hosting, connectivity, and technology products that enable all functions at every Roche site to develop, innovate, connect, and deliver compliant digital products across the Roche Enterprise.
The Value Streams - Accelerated Compute Engineering (ACE) Team is focused on driving both customer success and platform success by acting as a center of excellence and delivery for the High Performance Compute and AI Infrastructure supporting AI and HPC use cases across Roche. This team facilitates seamless onboarding and adoption for business vertical customers needing accelerated compute—helping those infrastructure consumers with needs optimized for high availability, seamless data transfer, flexibility, speed, and the rapidly changing needs of AI—helping achieve rapid time-to-value.
Job Responsibilities
Automated Provisioning & Cluster Orchestration
Design, deploy, and manage large-scale automated provisioning systems for multi-node HPC and AI Factory environments.
Own and maintain the infrastructure management and provisioning tech stack underpinning the orchestration, monitoring, and provisioning of complex GPU and CPU workloads.
Streamline bare-metal provisioning and node imaging pipelines to ensure minimal downtime and rapid expansion capabilities.
Infrastructure-as-Code (IaC) & Configuration Governance
Enforce a strict configuration-as-code and infrastructure-as-code mindset, replacing manual interventions with repeatable automation scripts.
Author, review, and maintain complex Ansible playbooks and roles for configuration management, patch deployment, and compliance drift remediation.
Establish robust CI/CD pipelines using GitLab to test, validate, and deploy infrastructure changes safely across development, staging, and production clusters.
Operating System Engineering & Lifecycle Management
In partnership with Enterprise OS teams, standardize and manage operating system builds, with dual proficiency across HPC and AI Factory platforms.
Utilize solutions such as Red Hat Image Builder and NVIDIA Base Command Manager to create optimized, compliant, and secure custom golden images tailored for AI and high-performance computing workloads.
Manage OS lifecycles, including kernel tuning, automated package updates, and vulnerability management, ensuring alignment with global security standards.
Platform Reliability & Collaboration
Implement proactive monitoring and alerting for infrastructure provisioning health, node availability, and configuration drifts.
Address and help resolve complex, systemic infrastructure failures, contributing to post-mortem analyses to continuously improve platform resilience.
Qualifications
Education / Experience
Bachelor’s or an advanced degree in Computer Science, Computer Engineering, or a similar technical discipline.
5+ years of experience in systems engineering, DevOps, or platform infrastructure roles, with a proven track record of managing enterprise Linux environments at scale.
Deep, practical knowledge of operating system internals for both RHEL and Ubuntu OS.
Technical & Business Skills:
Automation & Orchestration: Advanced capability with Ansible on the command line and experience building scalable infrastructure pipelines using GitLab CI/CD.
Provisioning Tooling: Experience using NVIDIA Base Command Manager (Bright Cluster Manager) and Red Hat Image Builder (or related tools like Kickstart/Satellite).
Modern Engineering Mindset: Strong adherence to git-based workflows, code-review methodologies, and infrastructure-as-code principles.
Troubleshooting Depth: Ability to isolate complex, multi-layered faults bridging hardware, kernel configurations, and automation scripts.
Leadership & Mindset:
Lean & Agile Mindset: Passionate about continuous improvement, eliminating technical debt, and automating repetitive tasks to achieve scale.
Collaboration & Communication: Strong collaborative skills with an enterprise mindset, capable of working fluidly across team boundaries to drive platform success.
Intellectual Curiosity: Highly self-motivated to explore and adopt emerging technologies in the fast-evolving landscape of HPC and AI infrastructure engineering
Who we are
A healthier future drives us to innovate. Together, more than 100’000 employees across the globe are dedicated to advance science, ensuring everyone has access to healthcare today and for generations to come. Our efforts result in more than 26 million people treated with our medicines and over 30 billion tests conducted using our Diagnostics products. We empower each other to explore new possibilities, foster creativity, and keep our ambitions high, so we can deliver life-changing healthcare solutions that make a global impact.
Let’s build a healthier future, together.
Roche is an Equal Opportunity Employer.
Skills Required
- Bachelor's degree in Computer Science, Computer Engineering, or similar technical discipline
- 5+ years experience in systems engineering, DevOps, or platform infrastructure roles
- Proven track record managing enterprise Linux environments at scale
- Deep, practical knowledge of RHEL and Ubuntu operating system internals
- Advanced capability with Ansible (playbooks and roles) for configuration management
- Experience building CI/CD pipelines for infrastructure changes using GitLab CI/CD
- Experience with NVIDIA Base Command Manager and/or Bright Cluster Manager for cluster provisioning
- Experience with Red Hat Image Builder or related tools (Kickstart, Satellite) for OS image creation
- Experience with bare-metal provisioning and node imaging pipelines for HPC/AI clusters
- OS lifecycle management skills including kernel tuning, automated package updates, and vulnerability management
- Strong troubleshooting ability across hardware, kernel configurations, and automation scripts
- Adherence to infrastructure-as-code and configuration-as-code principles and git-based workflows
- Experience implementing monitoring and alerting for provisioning health and node availability
- Ability to collaborate across teams and apply Agile/Lean practices to platform reliability
Roche Compensation & Benefits Highlights
The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Roche and has not been reviewed or approved by Roche.
-
Retirement Support — U.S. materials describe a 401(k) with both matching and an additional company contribution, supported by formal plan documents and true‑up features. This structure is positioned as a standout element of the total package, particularly at Genentech.
-
Leave & Time Off Breadth — Time‑off provisions include substantial vacation, a year‑end shutdown, and a paid six‑week sabbatical after six years. These elements indicate a recharge‑oriented approach within the U.S. offering.
-
Healthcare Strength — Company materials emphasize comprehensive medical, dental, vision, and mental‑health resources alongside well‑being programs. Benefits pages consistently highlight breadth across core health coverage elements.
Roche Insights
What We Do
Roche is a global pioneer in pharmaceuticals and diagnostics focused on advancing science to improve people’s lives. The combined strengths of pharmaceuticals and diagnostics under one roof have made Roche the leader in personalised healthcare – a strategy that aims to fit the right treatment to each patient in the best way possible. Roche is the world’s largest biotech company, with truly differentiated medicines in oncology, immunology, infectious diseases, ophthalmology and diseases of the central nervous system. Roche is also the world leader in in vitro diagnostics and tissue-based cancer diagnostics, and a frontrunner in diabetes management. Founded in 1896, Roche continues to search for better ways to prevent, diagnose and treat diseases and make a sustainable contribution to society. The company also aims to improve patient access to medical innovations by working with all relevant stakeholders. Thirty medicines developed by Roche are included in the World Health Organization Model Lists of Essential Medicines, among them life-saving antibiotics, antimalarials and cancer medicines. Roche has been recognised as the Group Leader in sustainability within the Pharmaceuticals, Biotechnology & Life Sciences Industry ten years in a row by the Dow Jones Sustainability Indices (DJSI).








