Principal Site Reliability Engineer (Intelligent Automation)

Reposted 18 Days Ago
Be an Early Applicant
South San Francisco, CA, USA
In-Office
163K-302K Annually
Senior level
Healthtech • Biotech
The Role
The role involves architecting and implementing Infrastructure as Code (IaC) solutions for ML and HPC workloads, ensuring global availability, automating processes, leading technical teams, and optimizing costs while maintaining compliance.
Summary Generated by Built In

A healthier future. It’s what drives us to innovate. To continuously advance science and ensure everyone has access to the healthcare they need today and for generations to come. Creating a world where we all have more time with the people we love. That’s what makes us Roche.

Advances in AI, data and computational sciences are transforming drug discovery and development. Roche’s Research and Early Development organizations at Genentech (gRED) and Pharma (pRED) have demonstrated how these technologies accelerate R&D, leveraging data and novel computational models to drive impact. Seamless data sharing and access to models across gRED and pRED are essential to maximising these opportunities. The Computational Sciences Center of Excellence (CS CoE) is a strategic, unified group whose goal is to harness the transformative  power of data and Artificial Intelligence (AI) to assist our scientists in both pRED and gRED to deliver more innovative and life-changing  medicines for patients worldwide.

Within the CS CoE organisation, the Data and Digital Catalyst (DDC) organization leads the modernization of our computational and data ecosystems by integrating digital technologies across Research and Early Development to empower stakeholders, advance data-driven science and accelerate decision-making.

The Solutions team within the DDC Organization develops modernized and interconnected computational and data ecosystems.  As a Site Reliability Engineer in the Solutions Engineering capability, you will work closely with our engineering colleagues to  play a pivotal role in designing, implementing, and maintaining scalable, resilient, and supportable cloud-based platform solutions. 

The focus will be on enabling research Application, Machine Learning (ML) workloads and HPC environments through automation, efficient resource management, and Infrastructure as Code (IaC) using tooling.  As a member of the DDC team you will help mature the scalable platforms that help unlock the potential of our diverse scientific data, accelerating the discovery and development of life-changing treatments for patients. 

The Opportunity: 

Infrastructure as Code (IaC) Design and Implementation

  • Architect and implement IaC solutions using tools like Terraform, Spacelift, or CloudFormation to provision and manage cloud infrastructure for ML and HPC workloads. Automate the deployment of scalable ML pipelines, HPC clusters, and supporting services across global regions.

Global Availability and Resiliency

  • Architect resilient and highly available solutions for ML and HPC workloads using cloud-native practices such as auto-scaling, load balancing, and failover mechanisms. Implement disaster recovery (DR) and business continuity plans for critical systems to ensure global operational integrity. Conduct chaos engineering experiments to validate system reliability and identify potential weaknesses.

Automation and Observability

  • Develop automation scripts and workflows to streamline infrastructure management, deployment, and scaling for ML and HPC use cases. Implement robust monitoring, logging, and alerting frameworks using tools like Prometheus, Grafana, Datadog, or ELK Stack to provide deep insights into system health and performance. Knowledge of AIOps incident management, processes and tooling. 

Collaboration and Leadership

  • Provide technical leadership to a team of engineers, fostering a culture of collaboration, innovation, and continuous improvement. Partner with cross-functional teams to align infrastructure solutions with business objectives and ML/HPC workload requirements. Mentor and train junior engineers in IaC practices, ML, and HPC infrastructure design.

Cost Optimization and Governance

  • Monitor and optimize cloud infrastructure usage and costs for ML and HPC workloads. Ensure compliance with organizational security, governance, and regulatory policies in all IaC and cloud implementations.

Who You Are: 

  • Bachelor’s or Master’s degree in Computer Science or similar technical field, or equivalent experience and 7+ years of experience in software engineering  Site Reliability Engineering (SRE).

  • Proven expertise in supporting and deploying IaC solutions in cloud environments (AWS, Azure, or GCP) for ML and HPC workloads.

  • Background in MLOps pipelines, including model versioning, CI/CD for ML, and feature store integration including experience with managed ML services (e.g., AWS SageMaker, Google AI Platform, or Azure ML).

  • Deep understanding of cloud-native architectures, including autoscaling, serverless, and multi-region deployments.

  • Technical Skills:

    • Advanced proficiency with IaC tools: Terraform, Pulumi, or CloudFormation.

    • Expert in scripting and automation: Python, Bash, or Go.

    • Strong understanding of GPU-accelerated computing (e.g., NVIDIA CUDA, TensorFlow) and HPC workload scaling.

    • Knowledge of distributed systems, storage solutions, and data pipelines.

    • Familiar with monitoring and observability tools: Prometheus, Grafana, Datadog, or similar.

  • Soft Skills:

    • Strong problem-solving skills, with a methodical approach to troubleshooting.

    • Excellent communication, leadership, and mentoring abilities.

    • Ability to work collaboratively across teams in a fast-paced, dynamic environment.

Preferred Qualifications

  • Certifications in cloud platforms (e.g., AWS Certified Solutions Architect, GCP Professional Cloud Architect, or Azure Solutions Architect).

  • Experience with distributed ML frameworks and data engineering pipelines  (e.g., Horovod, TensorFlow Distributed, Apache Airflow, Apache Spark ).

  • Experience with compliance frameworks (e.g., GDPR, SOC 2, ISO 27001).

Onsite presence, on our South San Francisco campus, is expected for at least 3 days a week.

Relocation benefits are not available for this job posting.

The expected salary range for this position based on the primary location of California is $162,600 - $302,000.  Actual pay will be determined based on experience, qualifications, geographic location, and other job-related factors permitted by law.  A discretionary annual bonus may be available based on individual and Company performance.  This position also qualifies for the benefits detailed at the link provided below.

Benefits

#LI-JD1

#ComputationCoE

Genentech is an equal opportunity employer. It is our policy and practice to employ, promote, and otherwise treat any and all employees and applicants on the basis of merit, qualifications, and competence. The company's policy prohibits unlawful discrimination, including but not limited to, discrimination on the basis of Protected Veteran status, individuals with disabilities status, and consistent with all federal, state, or local laws.

If you have a disability and need an accommodation in relation to the online application process, please contact us by completing this form Accommodations for Applicants.

Skills Required

  • 7+ years of experience in software engineering or Site Reliability Engineering (SRE)
  • Proven expertise in supporting and deploying IaC solutions in cloud environments (AWS, Azure, or GCP) for ML and HPC workloads
  • Advanced proficiency with IaC tools: Terraform, Pulumi, or CloudFormation
  • Expert in scripting and automation: Python, Bash, or Go
  • Strong understanding of GPU-accelerated computing and HPC workload scaling

Genentech Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Genentech and has not been reviewed or approved by Genentech.

  • Healthcare Strength Health coverage is described as comprehensive across medical, dental, vision, mental health, and prescriptions, supported by HSAs/FSAs and broad wellness resources. On‑site fitness and health centers, mental‑health clinicians, and specialized programs like fully covered preventive cancer screenings and menopause support deepen the offering.
  • Retirement Support Retirement benefits feature a 401(k) with up to a 4% company match plus an additional annual 6% company contribution to eligible pay. Additional financial protections such as life and accident insurance complement salary, bonuses, and stock options.
  • Leave & Time Off Breadth Time away includes about 20 paid vacation days, paid holidays, personal days, and a year‑end shutdown. A paid six‑week sabbatical every six years notably expands long‑term time‑off flexibility.

Genentech Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: South San Francisco, CA
20,069 Employees
Year Founded: 1976

What We Do

Considered the founder of the industry, Genentech, now a member of the Roche Group, has been delivering on the promise of biotechnology for more than 40 years. Genentech is a biotechnology company dedicated to pursuing groundbreaking science to discover and develop medicines for people with serious and life-threatening diseases. Our transformational discoveries include the first targeted antibody for cancer and the first medicine for primary progressive multiple sclerosis. We're passionate about finding solutions for people facing the world's most difficult-to-treat conditions. That is why we use cutting-edge science to create and deliver innovative medicines around the globe. To us, science is personal. Making a difference in the lives of millions starts when you make a change in yours.

Similar Jobs

Wells Fargo Logo Wells Fargo

Relationship Banker Reseda

Fintech • Financial Services
Remote or Hybrid
California, USA
205000 Employees
27K-41K Hourly
Hybrid
Ontario, CA, USA
205000 Employees

Wells Fargo Logo Wells Fargo

Client Performance Analyst 1

Fintech • Financial Services
Hybrid
San Diego, CA, USA
205000 Employees
82K-125K Annually
Hybrid
Napa, CA, USA
205000 Employees

Similar Companies Hiring

Camber Thumbnail
Fintech • Healthtech • Social Impact
New York, New York
90 Employees
Sailor Health Thumbnail
Healthtech • Social Impact • Telehealth
New York City, NY
20 Employees
Granted Thumbnail
Mobile • Insurance • Healthtech • Financial Services • Artificial Intelligence
New York, New York
23 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account