MLOps Site Reliability Engineer

Posted 16 Days Ago
Be an Early Applicant
Chennai, Tamil Nadu
In-Office
2-2 Annually
Junior
Hardware • Semiconductor
The Role
The MLOps Site Reliability Engineer ensures the reliability and scalability of ML infrastructure, collaborating with teams to optimize workflows and manage CI/CD pipelines while ensuring security and compliance.
Summary Generated by Built In

Company Overview

KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem. Virtually every electronic device in the world is produced using our technologies. No laptop, smartphone, wearable device, voice-controlled gadget, flexible screen, VR device or smart car would have made it into your hands without us. KLA invents systems and solutions for the manufacturing of wafers and reticles, integrated circuits, packaging, printed circuit boards and flat panel displays. The innovative ideas and devices that are advancing humanity all begin with inspiration, research and development. KLA focuses more than average on innovation and we invest 15% of sales back into R&D. Our expert teams of physicists, engineers, data scientists and problem-solvers work together with the world’s leading technology providers to accelerate the delivery of tomorrow’s electronic devices. Life here is exciting and our teams thrive on tackling really hard problems. There is never a dull moment with us.

Group/Division

With over 40 years of semiconductor process control experience, chipmakers around the globe rely on KLA to ensure that their fabs ramp next-generation devices to volume production quickly and cost-effectively. Enabling the movement towards advanced chip design, KLA's Global Products Group (GPG), which is responsible for creating all of KLA’s metrology and inspection products, is looking for the best and the brightest research scientist, software engineers, application development engineers, and senior product technology process engineers. Central Engineering is KLA's largest engineering organization comprised of 9 Centers-of-Excellence (CoE) in various disciplines applied across all product groups in the company. These CoE include Handling & Automation, Precision Motion Control, Sensors & Image Acquisition, Platform Design, and Packaging Engineering, among others. Talent includes over 500 engineers across global centers in Israel, China, India, and the US. Each CoE contributes not just talent and deliverables per discipline toward product programs, but also subject matter expertise, best practices, roadmaps, specialized facilities, apparatus, models, and analytics. These differentiate KLA not only in WHAT we do, but also in HOW we do it.

Job Description/Preferred Qualifications

We are seeking a highly skilled and motivated MLOps Site Reliability Engineer (SRE) to join our team. In this role, you will be responsible for ensuring the reliability, scalability, and performance of our machine learning infrastructure. You will work closely with data scientists, machine learning engineers, and software developers to build and maintain robust and efficient systems that support our machine learning workflows. This position offers an exciting opportunity to work on cutting-edge technologies and make a significant impact on our organization's success.

Responsibilities:
  • Design, implement, and maintain scalable and reliable machine learning infrastructure.
  • Collaborate with data scientists and machine learning engineers to deploy and manage machine learning models in production.
  • Develop and maintain CI/CD pipelines for machine learning workflows.
  • Monitor and optimize the performance of machine learning systems and infrastructure.
  • Implement and manage automated testing and validation processes for machine learning models.
  • Ensure the security and compliance of machine learning systems and data.
  • Troubleshoot and resolve issues related to machine learning infrastructure and workflows.
  • Document processes, procedures, and best practices for machine learning operations.
  • Stay up-to-date with the latest developments in MLOps and related technologies.
Required Qualifications:
  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • Proven experience as a Site Reliability Engineer (SRE) or in a similar role.
  • Strong knowledge of machine learning concepts and workflows.
  • Proficiency in programming languages such as Python, Java, or Go.
  • Experience with cloud platforms such as AWS, Azure, or Google Cloud.
  • Familiarity with containerization technologies like Docker and Kubernetes.
  • Experience with CI/CD tools such as Jenkins, GitLab CI, or CircleCI.
  • Strong problem-solving skills and the ability to troubleshoot complex issues.
  • Excellent communication and collaboration skills.
Preferred Qualifications:
  • Master's degree in Computer Science, Engineering, or a related field.
  • Experience with machine learning frameworks such as TensorFlow, PyTorch, or Scikit-learn.
  • Knowledge of data engineering and data pipeline tools such as Apache Spark, Apache Kafka, or Airflow.
  • Experience with monitoring and logging tools such as Prometheus, Grafana, or ELK stack.
  • Familiarity with infrastructure as code (IaC) tools like Terraform or Ansible.
  • Experience with automated testing frameworks for machine learning models.
  • Knowledge of security best practices for machine learning systems and data.

Minimum Qualifications

Master's / Bachelor's Level Degree and related work experience of 2 years

We offer a competitive, family friendly total rewards package. We design our programs to reflect our commitment to an inclusive environment, while ensuring we provide benefits that meet the diverse needs of our employees.

KLA is proud to be an equal opportunity employer

Be aware of potentially fraudulent job postings or suspicious recruiting activity by persons that are currently posing as KLA employees.  KLA never asks for any financial compensation to be considered for an interview, to become an employee, or for equipment. Further, KLA does not work with any recruiters or third parties who charge such fees either directly or on behalf of KLA. Please ensure that you have searched KLA’s Careers website for legitimate job postings.  KLA follows a recruiting process that involves multiple interviews in person or on video conferencing with our hiring managers.  If you are concerned that a communication, an interview, an offer of employment, or that an employee is not legitimate, please send an email to [email protected] to confirm the person you are communicating with is an employee. We take your privacy very seriously and confidentially handle your information.

Top Skills

Airflow
Ansible
Apache Kafka
Spark
AWS
Azure
CircleCI
Docker
Elk Stack
Gitlab Ci
Go
GCP
Grafana
Java
Jenkins
Kubernetes
Prometheus
Python
PyTorch
Scikit-Learn
TensorFlow
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Milipitas, CA
10,001 Employees

What We Do

KLA develops industry-leading equipment and services that enable innovation throughout the electronics industry. We provide advanced process control and process-enabling solutions for manufacturing wafers and reticles. In close collaboration with leading customers across the globe, our expert teams of physicists, engineers, data scientists and problem-solvers design solutions that move the world forward.

Similar Jobs

ZS Logo ZS

Decision Analytics Associate

Artificial Intelligence • Healthtech • Professional Services • Analytics • Consulting
Hybrid
5 Locations
13000 Employees

FourKites Logo FourKites

Senior Engineering Manager

Artificial Intelligence • Big Data • Logistics • Machine Learning • Software • Transportation
Easy Apply
Remote or Hybrid
Chennai, Tamil Nadu, IND
475 Employees

Opendoor Logo Opendoor

Accountant

eCommerce • Fintech • Real Estate • Software • PropTech
Hybrid
Chennai, Tamil Nadu, IND
1600 Employees

Pfizer Logo Pfizer

Integration Engineer

Artificial Intelligence • Healthtech • Machine Learning • Natural Language Processing • Biotech • Pharmaceutical
Hybrid
2 Locations
121990 Employees

Similar Companies Hiring

Red 6 Thumbnail
Virtual Reality • Software • Hardware • Defense • Aerospace
Orlando, Florida
155 Employees
Blissway Thumbnail
Transportation • Software • Machine Learning • Internet of Things • Hardware • Fintech • Computer Vision
Denver, Colorado
20 Employees
Turion Space Thumbnail
Software • Manufacturing • Information Technology • Hardware • Defense • Artificial Intelligence • Aerospace
Irvine, CA
150 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account