Senior Networking Solution Test Engineer, AI Cluster Debugging

Reposted 8 Days Ago
Be an Early Applicant
Yokneam
In-Office
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Role
The Senior Networking Solution Test Engineer will design tests, troubleshoot AI clusters, and collaborate with teams to improve networking and system performance.
Summary Generated by Built In

We are looking for a Senior networking test engineer with strong system‑level debugging skills to join our End‑to‑End Verification team. You will work on cutting‑edge Ethernet‑based AI clusters, owning complex issues across hardware, system software and AI workloads. NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you! 

What you’ll be doing:

  • Design and review test and product requirements across the Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior

  • Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics

  • Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix

  • Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation

  • Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments

  • Define tests and guide the automation team to implement robust suites that produce actionable logs, metrics and traces

  • Run Regression, Performance, Functional and Scale testing, analyze results and provide clear, data‑driven reports to stakeholders

  • Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks

What we need to see:

  • B.A./B.Sc. in Computer Science, Electrical Engineering, or equivalent IT/Network/Systems experience

  • 5+ years of hands‑on networking or system‑level testing and debugging on Linux

  • Strong Linux networking and debugging skills (for example perf, tcpdump, ethtool, iproute2)

  • Proven production‑grade debugging experience: forming hypotheses, running experiments, and driving issues to root cause under pressure

  • Expertise in host‑side NIC validation and tuning (offloads, queues, interrupts, firmware/driver interactions)

  • Strong knowledge of AI networking libraries (such as NCCL) and protocols (such as RoCE and RDMA), including performance and correctness debugging

  • Ability to read and reason about source code (C/C++/Python or similar) and collaborate closely with developers on fixes

  • Solid scripting and automation skills with Bash / Python / Ansible for setup, log collection, and experiment orchestration

  • Fast learner, familiar with modern AI tools and workflows, able to adapt quickly

  • Excellent analytical, problem‑solving and communication skills, with strong ownership and a collaborative mindset

Ways to stand out from the crowd:

  • Hands‑on debugging of collective communication libraries (for example NCCL) or large‑scale LLM training / inference clusters

  • Experience with large cluster environments (tens to thousands of GPUs or nodes), including incident response and post‑mortem analysis

  • Deep expertise in tuning and debugging congestion control and lossless Ethernet for AI workloads (for example DCQCN, ECN, PFC)

  • Familiarity with NVIDIA networking technologies (for example BlueField / BF3, ConnectX NICs) and their software stack and diagnostics

  • Experience debugging issues that span multiple layers (L2/L3, transport, AI frameworks) or contributing to open‑source networking / AI systems

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#LI-Hybrid

Top Skills

Ansible
Bash
C
C++
Dpu
Ethernet
Linux
Nccl
Nic
Python
Rdma
Roce
Switch
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Santa Clara, CA
21,960 Employees
Year Founded: 1993

What We Do

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”

Similar Jobs

HiBob Logo HiBob

Senior Back-end Engineer

HR Tech • Information Technology • Professional Services • Sales • Software
Remote or Hybrid
Israel
1350 Employees

HiBob Logo HiBob

MIS Developer

HR Tech • Information Technology • Professional Services • Sales • Software
Remote or Hybrid
Israel
1350 Employees

CrowdStrike Logo CrowdStrike

Sales Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
Israel
10000 Employees

CrowdStrike Logo CrowdStrike

Sr. Knowledge Engineer (Remote, ISR)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
Israel
10000 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account