Roche Jobs

Workload Orchestration Engineer

Roche

Workload Orchestration Engineer

Posted Yesterday

Be an Early Applicant

Madrid, Comunidad de Madrid, ESP

In-Office

Senior level

Healthtech • Biotech • Pharmaceutical

The Role

Deploy, configure, and tune SLURM and Run:ai orchestration across HPC and AI platforms. Integrate SLURM Slinky with Kubernetes, define containerization best practices (Singularity/Apptainer/Enroot), optimize scheduling/topology-aware policies, profile queues and QoS, implement telemetry and config-as-code, and troubleshoot distributed training, MPI/NCCL, and driver issues to maximize multi-tenant resource utilization.

Summary Generated by Built In

At Roche you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop and cure diseases and ensure everyone has access to healthcare today and for generations to come. Join Roche, where every voice matters.

The Position

Job description

As a Workload Orchestration Engineer within the Accelerated Compute Engineering (ACE) team, you will be responsible for overseeing and advancing our workload orchestration tech stack across both our High-Performance Computing (HPC) and industry-leading AI Factory platforms. With the rapid expansion of our compute infrastructure, efficiently scheduling, managing, and maximizing the utilization of our CPU and GPU environments is paramount.

You will own the deployment, configuration, and fine-tuning of orchestration platforms that schedule massive, parallel computational workloads. By implementing robust scheduling policies for traditional scientific workflows and modern containerized AI workloads, you will bridge the gap between heavy compute capacity and efficient execution. Your work will directly ensure that Roche’s researchers, data scientists, and engineers can seamlessly run large-scale AI model training and computational science simulations at scale.

Description of the area

Hosting and Infrastructure (HI) provides mission-critical on-premise infrastructure, cloud hosting, connectivity, and technology products that enable all functions at every Roche site to develop, innovate, connect, and deliver compliant digital products across the Roche Enterprise.

The Value Streams - Accelerated Compute Engineering (ACE) Team is focused on driving both customer success and platform success by acting as a center of excellence and delivery for the High Performance Compute and AI Infrastructure supporting AI and HPC use cases across Roche. This team facilitates seamless onboarding and adoption for business vertical customers needing accelerated compute—helping those infrastructure consumers with needs optimized for high availability, seamless data transfer, flexibility, speed, and the rapidly changing needs of AI—helping achieve rapid time-to-value.

Job Responsibilities

Orchestration Stack Deployment & Governance

Design, implement, and maintain the SLURM Workload Manager ecosystem across our HPC cluster architectures, ensuring high availability and optimal resource distribution.
Deploy and manage Run:ai as the core orchestration and virtualization layer for the AI Factory, enabling fractional GPU allocation and dynamic resource allocation.
Evaluate, architect, and implement SLURM Slinky integrations where required to seamlessly bridge Kubernetes-based AI orchestration with traditional HPC cluster resources.

Containerization & Workload Optimization

Define best practices and frameworks for containerized scientific execution, utilizing Singularity/Apptainer and/or Enroot to provide secure, reproducible performance environments for HPC.
Translate user and workload requirements into optimized scheduling parameters (e.g., topology-aware scheduling, multi-node scaling).
Actively profile and tune scheduling queues, quality-of-service (QoS) parameters, and fair-share policies to maximize multi-tenant efficiency.

Platform Reliability & Telemetry

Partner with Observability Engineers to implement continuous monitoring, telemetry, and reporting dashboards to track scheduler efficiency, queue wait times, and hardware utilization rates.
Troubleshoot complex workload failures, including distributed training synchronization issues, MPI communication bottlenecks, and driver incompatibilities.
Maintain configuration-as-code models for the scheduling tier, leveraging automation to deploy cluster policies uniformly.

Qualifications

Education / Experience

Bachelor’s or an advanced degree in Computer Science, Applied Mathematics, Computational Engineering, or a similar technical discipline.
5+ years of systems engineering experience, with a heavy emphasis on workload scheduling, resource management, and cluster optimization for multi-tenant environments.
Deep technical familiarity with Enterprise Linux operating systems and distributed systems architecture.

HPC Scheduling & Tooling: Expert-level proficiency in administering SLURM, including complex partition designs, accounting, and plug-in management. Highly proficient with Singularity for container runtime execution.
AI Orchestration: Hands-on experience or deep architectural understanding of Run:ai, Kubernetes, and containerized GPU scheduling paradigms.
Infrastructure Literacy: Solid understanding of high-speed interconnects (InfiniBand, RoCE) and multi-node communication architectures (MPI, NCCL) as they relate to job placement.
Automation: Proficiency in automating scheduler configurations and telemetry gathering, or infrastructure automation tooling.

Leadership & Mindset:

Lean & Agile Mindset: Highly focused on driving efficiency, reducing idle compute time, and creating frictionless pathways for user workload submissions.
Collaboration & Advocacy: Outstanding capability to translate scientific and AI model workflow challenges into scalable scheduler configurations.
Intellectual Curiosity: A strong passion for remaining ahead of industry trends regarding GPU slicing, fractionalization, and the convergence of AI workloads with traditional HPC schedulers.

Who we are

A healthier future drives us to innovate. Together, more than 100’000 employees across the globe are dedicated to advance science, ensuring everyone has access to healthcare today and for generations to come. Our efforts result in more than 26 million people treated with our medicines and over 30 billion tests conducted using our Diagnostics products. We empower each other to explore new possibilities, foster creativity, and keep our ambitions high, so we can deliver life-changing healthcare solutions that make a global impact.

Let’s build a healthier future, together.

Roche is an Equal Opportunity Employer.

Skills Required

Bachelor's or advanced degree in Computer Science, Applied Mathematics, Computational Engineering, or similar
5+ years systems engineering experience focused on workload scheduling, resource management, and cluster optimization
Expert-level proficiency administering SLURM including partition design, accounting, and plug-in management
Highly proficient with Singularity (Apptainer) for container runtime execution
Hands-on experience or deep architectural understanding of Run:ai, Kubernetes, and containerized GPU scheduling
Deep technical familiarity with Enterprise Linux and distributed systems architecture
Solid understanding of high-speed interconnects (InfiniBand, RoCE) and multi-node communication architectures (MPI, NCCL)
Proficiency in automating scheduler configurations, telemetry gathering, or using infrastructure automation tooling
Ability to profile and tune scheduling queues, QoS, fair-share policies, and troubleshoot distributed workload failures
Collaboration skills to partner with observability teams and translate scientific/AI workflow needs into scheduler configurations

Roche Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Roche and has not been reviewed or approved by Roche.

Retirement Support — U.S. materials describe a 401(k) with both matching and an additional company contribution, supported by formal plan documents and true‑up features. This structure is positioned as a standout element of the total package, particularly at Genentech.
Leave & Time Off Breadth — Time‑off provisions include substantial vacation, a year‑end shutdown, and a paid six‑week sabbatical after six years. These elements indicate a recharge‑oriented approach within the U.S. offering.
Healthcare Strength — Company materials emphasize comprehensive medical, dental, vision, and mental‑health resources alongside well‑being programs. Benefits pages consistently highlight breadth across core health coverage elements.

Learn more about Roche's Compensation & Benefits →

Roche Insights

What's It Like to Work at Roche? Roche Culture & Values Roche Career Growth & Development What's the Work-Life Balance Like at Roche? Roche Leadership & Management Roche Company Growth, Stability & Outlook

View all jobs at Roche

View Roche Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

Provincia de Buenos Aires

93,797 Employees

Year Founded: 1896

What We Do

Roche is a global pioneer in pharmaceuticals and diagnostics focused on advancing science to improve people’s lives. The combined strengths of pharmaceuticals and diagnostics under one roof have made Roche the leader in personalised healthcare – a strategy that aims to fit the right treatment to each patient in the best way possible. Roche is the world’s largest biotech company, with truly differentiated medicines in oncology, immunology, infectious diseases, ophthalmology and diseases of the central nervous system. Roche is also the world leader in in vitro diagnostics and tissue-based cancer diagnostics, and a frontrunner in diabetes management. Founded in 1896, Roche continues to search for better ways to prevent, diagnose and treat diseases and make a sustainable contribution to society. The company also aims to improve patient access to medical innovations by working with all relevant stakeholders. Thirty medicines developed by Roche are included in the World Health Organization Model Lists of Essential Medicines, among them life-saving antibiotics, antimalarials and cancer medicines. Roche has been recognised as the Group Leader in sustainability within the Pharmaceuticals, Biotechnology & Life Sciences Industry ten years in a row by the Dow Jones Sustainability Indices (DJSI).