Designs and implements the low-level runtime stack that drives FuriosaAI's NPU hardware to its theoretical limits — from device driver interfaces and DMA-based I/O to kernel execution scheduling, multi-node inference, and embedded firmware.
ResponsibilitiesDevelops the low-level runtime responsible for DMA-based I/O operations and kernel execution scheduling, maximizing inference throughput while minimizing end-to-end latency.
Builds and optimizes asynchronous execution pipelines that orchestrate data movement and compute across the NPU hardware.
Enables multi-node inference by implementing foundational communication primitives, including RDMA-based data transfer for low-latency, high-bandwidth inter-node operations.
Develops embedded firmware (PERT) that runs on the NPU's integrated ARM core, managing on-device scheduling, synchronization, and hardware resource control.
Profiles and tunes system-level performance across the full runtime stack — from firmware to user-space — to eliminate bottlenecks in real-world inference workloads.
Bachelor's degree in Computer Science or equivalent work experience. Strong systems programming background with 3+ years of experience in Rust, C, or C++.
Bachelor's degree in Computer Science, Electrical Engineering, or equivalent work experience.
Strong communication skills for cross-team requirement gathering and technical alignment.
3+ years of systems programming experience in Rust, C, or C++.
Solid understanding of computer architecture fundamentals: memory hierarchy, cache coherency, OS, DMA, interrupts, and MMIO.
Deep expertise in low-latency runtime systems, embedded firmware development, or high-performance I/O — especially in the context of accelerator hardware.
Experience designing and implementing low-latency asynchronous execution models and scheduling systems.
Experience with DMA engines, scatter-gather I/O, or other zero-copy data transfer mechanisms.
Experience developing embedded firmware for ARM-based processors (bare-metal or lightweight RTOS environments).
Familiarity with RDMA technologies and high-performance networking for distributed or multi-node systems.
Experience with CUDA low-level runtime internals such as CUDA Graphs, stream-based execution, and asynchronous kernel launch optimization.
Experience with kernel-level performance optimizations (e.g., Linux kernel modules, eBPF, perf, ftrace).
Understanding of deep learning inference workloads and their hardware execution characteristics.
Experience with profiling and performance tuning of system software on accelerator or SoC platforms.
Skills Required
- Bachelor's degree in Computer Science or equivalent work experience
- 3+ years of systems programming experience in Rust, C, or C++
- Solid understanding of computer architecture fundamentals
What We Do
FuriosaAI designs and develops data center accelerators for the most advanced AI models and applications. Our mission is to make AI computing sustainable so everyone on Earth has access to powerful AI. Our Background Three misfit engineers with each from HW, SW and algorithm fields who had previously worked for AMD, Qualcomm and Samsung got together and founded FuriosaAI in 2017 to build the world’s best AI chips. The company has raised more than $100 million, with investments from DSC Investment, Korea Development Bank, and Naver, the largest internet provider in Korea. We have partnered on our first two products with a wide range of industry leaders including TSMC, ASUS, SK Hynix, GUC, and Samsung. FuriosaAI now has over 140 employees across Seoul, Silicon Valley, and Europe. Our Approach We are building full stack solutions to offer the most optimal combination of programmability, efficiency, and ease of use. We achieve this through a “first principles” approach to engineering: We start with the core problem, which is how to accelerate.








