AI Kernel Writer

Posted 22 Days Ago
Be an Early Applicant
Los Altos, CA, USA
In-Office
Mid level
Software
The Role
Design and optimize high-performance compute kernels for AI primitives (GEMM, attention, convolution), profile and tune across heterogeneous hardware, collaborate with compiler/runtime teams (Triton, PyTorch, SYCL), prototype precision formats, and contribute micro-architecture feedback while writing reusable C++/CUDA/Triton/MLIR code.
Summary Generated by Built In
Description
  • Design and implement high-performance compute kernels for AI primitives such as GEMM, attention, normalization, and convolution.
  • Optimize for throughput, latency, and memory hierarchy across heterogeneous compute units (SIMD, matrix engines, DMA).
  • Collaborate with compiler and runtime teams to integrate kernels into TritonPyTorch, or SYCL pipelines.
  • Profile and tune kernels using tools like PerfettoVTuneTracy, or custom simulators.
  • Prototype and evaluate precision formats (FP16/BF16/FP8/e5m2, etc.) and stochastic rounding.
  • Contribute to micro-architecture feedback loops, helping co-design ISA and memory features with the hardware team.
  • Write clear, well-structured, and reusable code (C++/CUDA/Triton/LLVM MLIR).


Requirements
  • Bachelor's or Master's in Computer ScienceComputer Engineering, or a related field from a recognized university.
  • Strong background in parallel programming (CUDA, Triton, SYCL, OpenCL, Metal, POSIX Threads, or OpenMP).
  • Experience with optimization of irregular algorithms, such as graph computations or sparse numerical linear algebra, combining high-level data structure design with low-level SIMD and synchronization optimizations.
  • Deep understanding of memory layoutvectorizationthread/block scheduling, and cache behavior.
  • Proficiency in C++11 or higher, with strong knowledge of standard algorithms, data structures, and generic programming paradigms.
  • Experience with code generation for high-performance computations and knowledge of frameworks like BLAS/BLIS/Torch
  • Skilled in performance analysis and parallel debugging using tools such as ValgrindGNU Debugger, or CI testing frameworks.
  • Hands-on experience profiling and optimizing compute or AI workloads (e.g., GEMM, softmax, attention).
  • Solid grasp of numerical stabilityprecision formats, and mixed precision arithmetic.
  • Collaborative work style with the ability to operate effectively in multicultural, cross-disciplinary environments.

Skills Required

  • Bachelor's or Master's in Computer Science, Computer Engineering, or related field
  • Strong background in parallel programming (CUDA, Triton, SYCL, OpenCL, Metal, POSIX Threads, OpenMP)
  • Experience implementing high-performance kernels in C++, CUDA, Triton, or LLVM MLIR
  • Experience optimizing irregular algorithms (graph computations, sparse numerical linear algebra)
  • Deep understanding of memory layout, vectorization, thread/block scheduling, and cache behavior
  • Proficiency in C++11 or higher and generic programming paradigms
  • Experience with code generation for high-performance computations and knowledge of BLAS/BLIS/Torch
  • Skilled in performance analysis and parallel debugging using Valgrind, GDB, CI testing frameworks
  • Hands-on experience profiling and optimizing compute or AI workloads (GEMM, softmax, attention)
  • Solid grasp of numerical stability, precision formats, and mixed precision arithmetic (FP16/BF16/FP8/e5m2)
  • Collaborative work style; ability to work effectively in multicultural, cross-disciplinary teams
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
55 Employees
Year Founded: 2023

What We Do

Majestic Labs is reimagining AI infrastructure for the world’s most demanding workloads. Today, organizations are forced to overprovision expensive compute just to access the required memory their models need. We took a fundamentally different approach by pairing a massive amount of compute with 1000x the memory to create game changing improvements in performance, power and deployment efficiency. Our customers can literally replace racks of traditional AI infrastructure with a single Majestic server.

Similar Jobs

Upside Logo Upside

Strategic Account Manager

Artificial Intelligence • Fintech • Machine Learning • Mobile • Payments • Retail • Software
Remote or Hybrid
USA
275 Employees
140K-152K Annually

PwC Logo PwC

UKG WFM Pro - Manager

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
San Francisco, CA, USA
370000 Employees
99K-232K Annually

PwC Logo PwC

Front Office Strategy Consulting - PLS Customer Analytics - Manager

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
9 Locations
370000 Employees
99K-232K Annually

PwC Logo PwC

Contact Center Transformation - Senior Associate

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
16 Locations
370000 Employees
77K-202K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account