- Develop and refine a comprehensive 3-year roadmap for a software stack compatible with CUDA, encompassing Runtime, Driver, Compiler, Profiler, Debugger, and AI acceleration libraries
- Define binding specifications that link our upcoming GPU ISA to CUDA APIs, ensuring forward compatibility with CUDA 12.x features
- Evaluate and integrate the latest technological advancements: CUDA Graph, Transformer Engine, virtual memory management, CUDA dynamic CUTLASS 3.x, TMA, Blackwell FP4, among others
- Create a modular, layered Runtime architecture: CUDA → HAL → Kernel → Hardware, applicable across emulators, and actual silicon
- Define the task launch protocol, including Queue, Stream, Event, and Graph, as well as the memory model
- Design a dual-mode (JIT & offline) compiler supporting LTO, PGO, Auto-Tuning, and efficient PTX→ISA microcode caching
- Develop GPU virtualization schemes(MIG) that work across processes and containers
- Implement an end-to-end performance model: Python API → CUDA Runtime → Driver → ISA → Micro-architecture → Board-level interconnect
- Build an observability platform: Nsys-compatible traces, real-time Metric-QPS dashboards, and an AI Advisor for identifying bottlenecks automatically
- Manage internal AI benchmarks as the single source of truth. Benchmark includes MLPerf Inference, Stable Diffusion XL, and 70B LLM
- Co-design ISA which compatible with CUDA Compute Capability 12.x with our hardware architecture team
- Collaborate with AI framework teams (PyTorch, TensorFlow, JAX, ONNX Runtime) to build fully reusable kernel libraries
- Partner with Cloud and K8s teams to co-develop Device Plugins, GPU Operators, and RDMA Network Policies
Minimum Requirements:
- 10 years + in systems software, with at least 5 years in designing CUDA Compute stacks
- Led end-to-end development of a GPU Runtime or AI acceleration library generation
- Comprehensive mastery of PTX/SASS, CUDA Driver API, and cuBLAS/cuDNN internals; experience with LLVM NVPTX backend
- Profound understanding of GPU micro-architecture, including SM architecture, Warp Scheduler, Shared-Memory conflicts, and Tensor Core pipelines
- Proficiency with PCIe/CXL/RDMA topologies, NUMA settings, and GPU Direct RDMA/Storage
Top Skills
What We Do
Xpeng Motors is a leading Chinese electric vehicle and technology company that designs and manufactures intelligent automobiles that are seamlessly integrated with the Internet and utilize the latest advances in artificial intelligence. Focusing on China’s young and tech-savvy consumer base, XPENG Motors strives to offer smart mobility solutions with technology innovation and cutting-edge R&D. The company’s initial backers include its CEO & Chairman He Xiaopeng, the founder of UCWeb Inc. and a former Alibaba executive. It was co-founded in 2014 by Henry Xia and He Tao, former senior executives at Guangzhou Auto with expertise in innovative automotive technology and R&D. It has received funding from prominent Chinese and international investors including Alibaba Group, Foxconn Group and IDG Capital. Currently with 3,000 employees, the company is headquartered in Guangzhou and has design, R&D, manufacturing and sales & marketing divisions in Silicon Valley, San Diego, Beijing, Shanghai, Zhaoqing (Guangdong Province) and Zhengzhou (Henan Province).






