Majestic Labs AI

AI Kernel Writer

Reposted 16 Days Ago

Be an Early Applicant

Tel Aviv, ISR

In-Office

Senior level

Software

The Role

Design and implement high-performance kernels for AI primitives (GEMM, attention, normalization, convolution). Optimize throughput, latency, and memory hierarchy across heterogeneous compute units. Integrate kernels with Triton/PyTorch/SYCL, profile and tune with Perfetto/VTune/Tracy, prototype precision formats and stochastic rounding, and contribute micro-architecture feedback. Write reusable C++/CUDA/Triton/LLVM MLIR code.

Summary Generated by Built In

Description

Design and implement high-performance compute kernels for AI primitives such as GEMM, attention, normalization, and convolution.
Optimize for throughput, latency, and memory hierarchy across heterogeneous compute units (SIMD, matrix engines, DMA).
Collaborate with compiler and runtime teams to integrate kernels into Triton, PyTorch, or SYCL pipelines.
Profile and tune kernels using tools like Perfetto, VTune, Tracy, or custom simulators.
Prototype and evaluate precision formats (FP16/BF16/FP8/e5m2, etc.) and stochastic rounding.
Contribute to micro-architecture feedback loops, helping co-design ISA and memory features with the hardware team.
Write clear, well-structured, and reusable code (C++/CUDA/Triton/LLVM MLIR).

Requirements

Bachelor's or Master's in Computer Science, Computer Engineering, or a related field from a recognized university.
Strong background in parallel programming (CUDA, Triton, SYCL, OpenCL, Metal, POSIX Threads, or OpenMP).
Experience with optimization of irregular algorithms, such as graph computations or sparse numerical linear algebra, combining high-level data structure design with low-level SIMD and synchronization optimizations.
Deep understanding of memory layout, vectorization, thread/block scheduling, and cache behavior.
Proficiency in C++11 or higher, with strong knowledge of standard algorithms, data structures, and generic programming paradigms.
Experience with code generation for high-performance computations and knowledge of frameworks like BLAS/BLIS/Torch
Skilled in performance analysis and parallel debugging using tools such as Valgrind, GNU Debugger, or CI testing frameworks.
Hands-on experience profiling and optimizing compute or AI workloads (e.g., GEMM, softmax, attention).
Solid grasp of numerical stability, precision formats, and mixed precision arithmetic.
Collaborative work style with the ability to operate effectively in multicultural, cross-disciplinary environments.

Skills Required

Bachelor's or Master's in Computer Science, Computer Engineering, or related field
Strong background in parallel programming (CUDA, Triton, SYCL, OpenCL, Metal, POSIX Threads, or OpenMP)
Experience optimizing irregular algorithms (graph computations, sparse numerical linear algebra) combining high-level data structures with low-level SIMD and synchronization optimizations
Deep understanding of memory layout, vectorization, thread/block scheduling, and cache behavior
Proficiency in C++11 or higher with strong knowledge of algorithms, data structures, and generic programming
Experience with code generation for high-performance computations and familiarity with frameworks like BLAS/BLIS/Torch
Skilled in performance analysis and parallel debugging using tools such as Valgrind, GNU Debugger, or CI testing frameworks
Hands-on experience profiling and optimizing compute or AI workloads (e.g., GEMM, softmax, attention)
Solid grasp of numerical stability, precision formats, and mixed precision arithmetic (FP16/BF16/FP8/e5m2)
Ability to write clear, well-structured, reusable code (C++/CUDA/Triton/LLVM MLIR) and collaborate across compiler, runtime, and hardware teams

View all jobs at Majestic Labs AI

View Majestic Labs AI Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

55 Employees

Year Founded: 2023

What We Do

Majestic Labs is reimagining AI infrastructure for the world’s most demanding workloads. Today, organizations are forced to overprovision expensive compute just to access the required memory their models need. We took a fundamentally different approach by pairing a massive amount of compute with 1000x the memory to create game changing improvements in performance, power and deployment efficiency. Our customers can literally replace racks of traditional AI infrastructure with a single Majestic server.