HPC Engineer - Network

Reposted 5 Days Ago
Be an Early Applicant
Hiring Remotely in IND
Remote
Mid level
Big Data • Cloud • Hardware • Software • App development
WWT makes a new world happen.
The Role
Implement, configure, and tune high-performance fabrics for AI clusters (InfiniBand and RoCEv2). Run Ansible automation, validate links and performance (ib_*, NCCL-tests), manage UFM, support host networking and DPUs, perform firmware lifecycle work, and provide L2 network support during shift-aligned delivery windows.
Summary Generated by Built In
Job Summary & Responsibilities

Technical Competencies

Essential Skills

High-Performance Networking:

  • InfiniBand Mastery: Deep operational knowledge of NVIDIA Quantum InfiniBand switches, cable types (NDR/HDR), and troubleshooting commands.
  • AI Ethernet (RoCEv2): Solid understanding of RDMA over Converged Ethernet (RoCEv2), including the configuration of PFC and ECN on switches (Spectrum/Arista/Cisco/Juniper).
  • Fabric Management: Experience with NVIDIA UFM (Unified Fabric Manager) for managing large-scale fabrics.

Automation & Tools:

  • NetDevOps: Proficiency in Ansible for network automation (e.g., ansible-networking collections).
  • Linux Networking: Comfortable navigating Linux CLI to troubleshoot host-side networking (ip link, tcpdump, sysctl tuning).
  • Protocol Knowledge: Practical implementation skills in BGP, EVPN, and VXLAN for multi-tenant AI clouds.

Desirable Experience

  • Cisco AI Integration: Experience with Cisco Nexus Dashboard or Cisco 8000 series in an AI context.
  • DPU Configuration: Experience configuring NVIDIA BlueField DPUs and working with the DOCA framework.
  • Optical Networking: Ability to interpret transceiver signal levels (DOM/DDM) to diagnose Layer 1 optical faults.
  • AI Fabric Orchestration: Experience deploying and managing AI clusters using Netris or Cisco Nexus Hyperfabric AI to automate fabric provisioning and operations.

Certifications

Highly Desirable:

  • NVIDIA-Certified Professional: AI Networking (NCP-AIN)
  • NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)
  • Cisco Certified Network Professional (CCNP) Data Center

Success Metrics (KPIs)

  • Fabric Efficiency: Achieving >95% effective bandwidth efficiency on NCCL-test benchmarks across the provisioned cluster.
  • Layer 1 Accuracy: Zero "ghost links" or unstable connections handed over to the Compute team (all bad cables identified during the burn-in phase).
  • Ticket Velocity: Consistently meeting SLAs for network-related support tickets.
Preferred Qualifications

Role Title: HPC Engineer – Network

Location: India (Must align with Client Time Zone)

Employment Type: Full-Time


About the Role

The HPC Engineer - Network acts as the primary execution engine for the connectivity that binds the AI Factory together. While the Domain Architect designs the topology and the Senior Engineer directs the squad, you are the "Builder" responsible for configuring, stabilising, and tuning the high-speed interconnects. You are a "doer" who is as comfortable debugging a flapping InfiniBand link via CLI as you are pushing a configuration update to fifty switches using Ansible.

As a System Integrator, we design and deliver bespoke, high-scale AI factories. In this role, you will move beyond standard enterprise networking (campus/wifi) to execute the deployment of lossless, high-bandwidth fabrics for NVIDIA SuperPOD, NVIDIA BasePOD, and Cisco AI Factory environments.

In this role, you will operate with a 100% focus on Delivery, executing the Low-Level Designs (LLD) assigned by your Squad Lead. You will own the "Network" in the critical "Compute-Network-Storage" triad, ensuring that the GPU compute nodes can communicate at line rate without congestion.

CRITICAL REQUIREMENT: This role typically operates on Shift Hours to align with the onshore client's time zone (e.g., early shifts for Australian clients, or split shifts for European clients).

Key Responsibilities

  1. Fabric Configuration & Provisioning
  • Switch Configuration: Execute the configuration of high-performance switches (NVIDIA Quantum InfiniBand, NVIDIA Spectrum-X Ethernet, Cisco Nexus) using defined templates and automation.
  • NetDevOps Execution: Run and maintain Ansible playbooks to push configurations, update firmware (Cumulus/NX-OS), and enforce compliance across the fabric.
  • Subnet Management: Configure and tune NVIDIA Unified Fabric Manager (UFM) to ensure optimal routing and fault tolerance.
  • Host Networking: Assist the Compute team in configuring host-side adapters (ConnectX SuperNICs, BlueField DPUs) to ensure correct IP addressing, MTU, and driver parameters.
  1. Validation & Performance Tuning
  • Link Validation: Verify physical connectivity and link health using tools like ibstat, ibdiagnet, and ethtool to identify faulty cables or transceivers immediately after installation by Field Engineers.
  • Performance Testing: Execute network-specific benchmarks (e.g., ib_write_bw, ib_send_bw, NCCL-tests) to validate that the fabric is delivering full bi-sectional bandwidth and low latency.
  • Congestion Control: Implement and tune Quality of Service (QoS) settings, including Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and Data Center Quantized Congestion Notification (DCQCN) to prevent packet loss and optimize throughput in RoCEv2 environments.
  1. Operations & Support
  • Fabric Telemetry: Configure monitoring agents (e.g., Prometheus node exporters, UFM Telemetry) to visualize traffic flows and detect "tail latency" issues.
  • Ticket Resolution: Handle L2 support tickets for network issues, such as "Node isolation," "Slow All-Reduce operations," or "Fabric flapping."
  • Lifecycle Management: Execute firmware upgrades on switches and DPUs during maintenance windows, ensuring strict compatibility with the NVIDIA OFED stack.

Top Skills

Ansible
Arista
Bgp
Ccnp Data Center
Cisco
Cisco 8000
Cisco Nexus Dashboard
Cisco Nexus Hyperfabric Ai
Connectx Supernics
Cumulus
Doca
Ethtool
Evpn
Explicit Congestion Notification (Ecn)
Hdr
Ib_Send_Bw
Ib_Write_Bw
Ibdiagnet
Ibstat
Ip Link
Juniper
Linux Cli
Nccl
Ndr
Netris
Nvidia Bluefield Dpu
Nvidia Ofed
Nvidia Quantum Infiniband
Nvidia Spectrum-X
Nvidia Ufm
Nvidia-Certified Associate Ai Infrastructure And Operations (Nca-Aiio)
Nvidia-Certified Professional Ai Networking (Ncp-Ain)
Nx-Os
Priority Flow Control (Pfc)
Prometheus
Rocev2
Sysctl
Tcpdump
Transceiver Dom/Ddm
Ufm Telemetry
Vxlan
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Maryland Heights, MO
9,000 Employees
Year Founded: 1990

What We Do

World Wide Technology is a systems integrator, provides information technology and supply chain solutions. Fueled by creativity and ideation, World Wide Technology strives to accelerate our growth and nurture future innovation. From our world class culture, to our generous benefits, to developing cutting edge technology solutions, WWT constantly works towards its mission of creating a profitable growth company that is a great place to work. We encourage our employees to embrace collaboration, get creative and think outside the box when it comes to delivering some of the most advanced technology solutions for our customers. At a glance, WWT was founded in 1990 in St. Louis, Missouri. We employ over 9,000 individuals and closed nearly $17 Billion in revenue. We have an inclusive culture and believe our core values are the key to company and employee success. WWT is proud to announce that it has been named on the FORTUNE "100 Best Places to Work For®" list for the 12th consecutive year!

Why Work With Us

Our extensive partnership with best in class technology companies, coupled with our strong culture allow for world class delivery of transformative business solutions driven by IT.

Similar Jobs

Mondelēz International Logo Mondelēz International

Analyst, Service Operations (Delivery)

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
India
90000 Employees

MetLife Logo MetLife

Team leader

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
India
43000 Employees

SailPoint Logo SailPoint

Senior Software Engineer

Artificial Intelligence • Cloud • Sales • Security • Software • Cybersecurity • Data Privacy
Remote or Hybrid
India
2461 Employees

TransUnion Logo TransUnion

Process Capability Development Specialist

Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics
Remote or Hybrid
3 Locations
13000 Employees

Similar Companies Hiring

Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account