ML Performance Engineer

Reposted 6 Days Ago
Cupertino, CA
Hybrid
174K-231K Annually
Mid level
Artificial Intelligence • Energy
The Role
The ML Infrastructure Engineer will build and optimize scalable ML platforms, improve performance and efficiency of models, mentor junior engineers, and contribute to a mission-driven team focused on renewable energy.
Summary Generated by Built In
The Company
Gridmatic Inc. is a high-growth startup with offices in the Bay Area and Houston that is accelerating the clean energy transition by applying our expertise in data, machine learning, and energy to power markets. We are the rare startup that has multiple years of profitability without raising venture capital. At Gridmatic, we foster a collaborative and inclusive culture where learning and growth are constant. We move quickly, solve problems with integrity, and balance environmental responsibility with data-driven excellence.

We are looking for a Machine Learning Infrastructure Engineer to accelerate the decarbonization of the electricity system by building and optimizing the backbone of our ML platform. The ideal candidate will have solid expertise in machine learning, distributed systems and GPU-based training. They will design scalable, high-performance infrastructure for training, inference, and evaluation. They will push the boundaries of throughput and efficiency on large-scale time-series and weather datasets, while shaping the long-term vision of our ML platform. A successful candidate will thrive on continuous learning across engineering, ML systems, and energy markets, while contributing to a collaborative, mission-driven team.The ideal candidate must have strong deep learning fundamentals in addition to strong software engineering skills.

You will:

  • Own a significant piece of our ML platform while rapidly building and iterating scalable, robust distributed infrastructure for ML training, inference, and evaluation on large-scale time-series and weather datasets.
  • Optimize throughput and cost by supporting model training and deployment across multiple clusters and clouds.
  • Improve the efficiency of machine learning models and other workloads by optimizing latency, throughput, and memory consumption. This involves pushing the boundaries of current hardware capabilities through techniques like GPU performance engineering.
  • Help define the long-term vision for Gridmatic’s ML platform.
  • Play a key role in mentoring junior engineers and interns, contributing to a collaborative, innovative, and growth-oriented team culture.

You must be:

  • A strong engineer with 3+ years of full-time industry experience working on ML systems.. You possess a deep understanding of the codebases you work in and write readable, scalable code.
  • Experienced in optimizing GPU throughput in deep learning models.
  • Experienced in distributed training and inference of large models on GPU clusters, utilizing core libraries and frameworks such as PyTorch, PyTorch Lightning, and Ray.
  • A self-starter with a strong sense of independence and ownership, and the capability to engineer large, robust systems from the initial design and conceptualization to productionization.
  • Hold a Masters or Doctorate degree in engineering or a related technical field.
  • A mission-driven individual who is enthusiastic about working toward a renewable grid and diving into the intersection of ML and energy. No prior energy experience required, but curiosity and a willingness to learn are must-haves!

Nice to haves:

  • End to end proficiency in building, maintaining, and debugging cluster infrastructure, utilizing Kubernetes and Terraform.
  • Expertise in identifying performance bottlenecks and designing and writing high-performance code for large-scale ML workloads.
  • Experience with at least one of: torch.profiler, TorchDynamo, TorchInductor, Triton, or other deep learning compiler stacks.
  • Understanding of GPU architectures or experience with GPU kernel programming.
  • Knowledge of cluster communication protocols such as nccl or gloo.
  • Experience working with any of the following: weather data, energy systems, time-series forecasting, electricity markets, or financial trading.

#LI-DNI

Join our team and make a difference! Click below or email us at [email protected].

Top Skills

Distributed Systems
Gpu
Kubernetes
Machine Learning
PyTorch
Pytorch Lightning
Ray
SQL
Terraform
Zarr
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Cupertino, CA
33 Employees

What We Do

Gridmatic is an AI-powered energy company focused on decarbonizing the grid by helping with the transition to clean energy. We utilize advanced AI to model and optimize energy supply, demand, and transactions. By leveraging these insights, we aim to increase the adoption of clean energy sources while enhancing the stability and efficiency of the US energy market

Similar Jobs

Lemurian Labs Logo Lemurian Labs

Senior ML Performance Engineer

Artificial Intelligence • Machine Learning • Software
Hybrid
8 Locations
33 Employees

Nuro Logo Nuro

Machine Learning Engineer

Artificial Intelligence • Automotive • Information Technology • Robotics
In-Office
Mountain View, CA, USA
908 Employees
235K-352K Annually

fal Logo fal

Staff Software Engineer

Cloud • Digital Media • Information Technology
In-Office
San Francisco, CA, USA
73 Employees
180K-250K Annually
In-Office
3 Locations
2359 Employees
238K-302K Annually

Similar Companies Hiring

Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account