At Roche you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop and cure diseases and ensure everyone has access to healthcare today and for generations to come. Join Roche, where every voice matters.
The PositionJob description
As a Workload Orchestration Engineer within the Accelerated Compute Engineering (ACE) team, you will be responsible for overseeing and advancing our workload orchestration tech stack across both our High-Performance Computing (HPC) and industry-leading AI Factory platforms. With the rapid expansion of our compute infrastructure, efficiently scheduling, managing, and maximizing the utilization of our CPU and GPU environments is paramount.
You will own the deployment, configuration, and fine-tuning of orchestration platforms that schedule massive, parallel computational workloads. By implementing robust scheduling policies for traditional scientific workflows and modern containerized AI workloads, you will bridge the gap between heavy compute capacity and efficient execution. Your work will directly ensure that Roche’s researchers, data scientists, and engineers can seamlessly run large-scale AI model training and computational science simulations at scale.
Description of the area
Hosting and Infrastructure (HI) provides mission-critical on-premise infrastructure, cloud hosting, connectivity, and technology products that enable all functions at every Roche site to develop, innovate, connect, and deliver compliant digital products across the Roche Enterprise.
The Value Streams - Accelerated Compute Engineering (ACE) Team is focused on driving both customer success and platform success by acting as a center of excellence and delivery for the High Performance Compute and AI Infrastructure supporting AI and HPC use cases across Roche. This team facilitates seamless onboarding and adoption for business vertical customers needing accelerated compute—helping those infrastructure consumers with needs optimized for high availability, seamless data transfer, flexibility, speed, and the rapidly changing needs of AI—helping achieve rapid time-to-value.
Job Responsibilities
Orchestration Stack Deployment & Governance
Design, implement, and maintain the SLURM Workload Manager ecosystem across our HPC cluster architectures, ensuring high availability and optimal resource distribution.
Deploy and manage Run:ai as the core orchestration and virtualization layer for the AI Factory, enabling fractional GPU allocation and dynamic resource allocation.
Evaluate, architect, and implement SLURM Slinky integrations where required to seamlessly bridge Kubernetes-based AI orchestration with traditional HPC cluster resources.
Containerization & Workload Optimization
Define best practices and frameworks for containerized scientific execution, utilizing Singularity/Apptainer and/or Enroot to provide secure, reproducible performance environments for HPC.
Translate user and workload requirements into optimized scheduling parameters (e.g., topology-aware scheduling, multi-node scaling).
Actively profile and tune scheduling queues, quality-of-service (QoS) parameters, and fair-share policies to maximize multi-tenant efficiency.
Platform Reliability & Telemetry
Partner with Observability Engineers to implement continuous monitoring, telemetry, and reporting dashboards to track scheduler efficiency, queue wait times, and hardware utilization rates.
Troubleshoot complex workload failures, including distributed training synchronization issues, MPI communication bottlenecks, and driver incompatibilities.
Maintain configuration-as-code models for the scheduling tier, leveraging automation to deploy cluster policies uniformly.
Qualifications
Education / Experience
Bachelor’s or an advanced degree in Computer Science, Applied Mathematics, Computational Engineering, or a similar technical discipline.
5+ years of systems engineering experience, with a heavy emphasis on workload scheduling, resource management, and cluster optimization for multi-tenant environments.
Deep technical familiarity with Enterprise Linux operating systems and distributed systems architecture.
HPC Scheduling & Tooling: Expert-level proficiency in administering SLURM, including complex partition designs, accounting, and plug-in management. Highly proficient with Singularity for container runtime execution.
AI Orchestration: Hands-on experience or deep architectural understanding of Run:ai, Kubernetes, and containerized GPU scheduling paradigms.
Infrastructure Literacy: Solid understanding of high-speed interconnects (InfiniBand, RoCE) and multi-node communication architectures (MPI, NCCL) as they relate to job placement.
Automation: Proficiency in automating scheduler configurations and telemetry gathering, or infrastructure automation tooling.
Leadership & Mindset:
Lean & Agile Mindset: Highly focused on driving efficiency, reducing idle compute time, and creating frictionless pathways for user workload submissions.
Collaboration & Advocacy: Outstanding capability to translate scientific and AI model workflow challenges into scalable scheduler configurations.
Intellectual Curiosity: A strong passion for remaining ahead of industry trends regarding GPU slicing, fractionalization, and the convergence of AI workloads with traditional HPC schedulers.
Who we are
A healthier future drives us to innovate. Together, more than 100’000 employees across the globe are dedicated to advance science, ensuring everyone has access to healthcare today and for generations to come. Our efforts result in more than 26 million people treated with our medicines and over 30 billion tests conducted using our Diagnostics products. We empower each other to explore new possibilities, foster creativity, and keep our ambitions high, so we can deliver life-changing healthcare solutions that make a global impact.
Let’s build a healthier future, together.
Roche is an Equal Opportunity Employer.
Skills Required
- Bachelor's or advanced degree in Computer Science, Applied Mathematics, Computational Engineering, or similar
- 5+ years systems engineering experience focused on workload scheduling, resource management, and cluster optimization
- Expert-level proficiency administering SLURM including partition design, accounting, and plug-in management
- Highly proficient with Singularity (Apptainer) for container runtime execution
- Hands-on experience or deep architectural understanding of Run:ai, Kubernetes, and containerized GPU scheduling
- Deep technical familiarity with Enterprise Linux and distributed systems architecture
- Solid understanding of high-speed interconnects (InfiniBand, RoCE) and multi-node communication architectures (MPI, NCCL)
- Proficiency in automating scheduler configurations, telemetry gathering, or using infrastructure automation tooling
- Ability to profile and tune scheduling queues, QoS, fair-share policies, and troubleshoot distributed workload failures
- Collaboration skills to partner with observability teams and translate scientific/AI workflow needs into scheduler configurations
Roche Compensation & Benefits Highlights
The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Roche and has not been reviewed or approved by Roche.
-
Retirement Support — U.S. materials describe a 401(k) with both matching and an additional company contribution, supported by formal plan documents and true‑up features. This structure is positioned as a standout element of the total package, particularly at Genentech.
-
Leave & Time Off Breadth — Time‑off provisions include substantial vacation, a year‑end shutdown, and a paid six‑week sabbatical after six years. These elements indicate a recharge‑oriented approach within the U.S. offering.
-
Healthcare Strength — Company materials emphasize comprehensive medical, dental, vision, and mental‑health resources alongside well‑being programs. Benefits pages consistently highlight breadth across core health coverage elements.
Roche Insights
What We Do
Roche is a global pioneer in pharmaceuticals and diagnostics focused on advancing science to improve people’s lives. The combined strengths of pharmaceuticals and diagnostics under one roof have made Roche the leader in personalised healthcare – a strategy that aims to fit the right treatment to each patient in the best way possible. Roche is the world’s largest biotech company, with truly differentiated medicines in oncology, immunology, infectious diseases, ophthalmology and diseases of the central nervous system. Roche is also the world leader in in vitro diagnostics and tissue-based cancer diagnostics, and a frontrunner in diabetes management. Founded in 1896, Roche continues to search for better ways to prevent, diagnose and treat diseases and make a sustainable contribution to society. The company also aims to improve patient access to medical innovations by working with all relevant stakeholders. Thirty medicines developed by Roche are included in the World Health Organization Model Lists of Essential Medicines, among them life-saving antibiotics, antimalarials and cancer medicines. Roche has been recognised as the Group Leader in sustainability within the Pharmaceuticals, Biotechnology & Life Sciences Industry ten years in a row by the Dow Jones Sustainability Indices (DJSI).








