Network Engineer, AI/ML Infrastructure

Posted Yesterday
Santa Clara, CA
In-Office
150K-250K Annually
Mid level
Artificial Intelligence • Machine Learning
The Role
This role involves designing, building, and optimizing high-performance networking infrastructure for AI/ML operations. Responsibilities include managing InfiniBand and Ethernet fabrics, troubleshooting issues, and collaborating on network upgrades and security implementations.
Summary Generated by Built In
About The Role

We're seeking an experienced Network Engineer to design, build, and optimize the high-performance networking infrastructure powering our AI/ML operations in Toronto. You'll work at the cutting edge of network technology—managing InfiniBand and ultra-high-speed Ethernet fabrics that connect NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, and hundreds of servers.

You'll be hands-on with the full lifecycle of our network infrastructure: planning, building, testing, deploying, and keeping everything running at peak performance. That means troubleshooting issues as they arise, monitoring network performance and throughput, developing automation to streamline operations, and working closely with HPC and ML teams to ensure they have the bandwidth they need. You'll also help us plan for future capacity and evaluate emerging network technologies as we scale to meet increasingly demanding workloads.

Responsibilities

  • Configure and maintain InfiniBand and high-speed Ethernet fabrics
  • Optimize network performance for RDMA, and GPU-to-GPU communication
  • Manage network switches (Mellanox, NVIDIA, Micas Networks)
  • Troubleshoot network bottlenecks and latency issues
  • Plan and execute network upgrades and expansions
  • Network security implementation (firewalls, VLANs, ACLs)
  • Collaborate on storage network optimizationInfrastructure monitoring

Minimum Qualifications

  • 4+ years of network engineering experience in production environments
  • Strong understanding of L2/L3 networking protocols (TCP/IP, BGP, OSPF, VLANs)
  • Hands-on experience with high-speed networking (100Gb+ Ethernet and InfiniBand)
  • Hands-on experience with network security (firewalls, ACLs, network segmentation)
  • Knowledge of HPC network topologies
  • Experience with InfiniBand fabrics including RDMA, RoCE, IPoIB
  • Strong troubleshooting and problem-solving skills

Preferred Qualifications

  • Experience in data center environments or AI/ML infrastructure
  • Hands-on experience with high-performance Ethernet switches (e.g., Broadcom Tomahawk), and latest InfiniBand switches (e.g., Nvidia/Mellanox)
  • Experience optimizing networks for GPU-to-GPU communication
  • Experience with open-source firewall solutions (OPNsense, pfSense, or similar)
  • Experience with network automation tools
  • Understanding of distributed storage networking (Ceph cluster networks)
  • Familiarity with network monitoring and observability tools (Prometheus, Grafana)
  • Knowledge of multi-site network connectivity and WAN optimization
  • Familiarity with cloud networking in at least one platform (AWS, GCP, or Azure) including VPC design, site-to-site VPN configuration, Direct Connect/ExpressRoute/Cloud Interconnect, hybrid cloud connectivity, and cloud-to-datacenter network integration

If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.

Top Skills

Acls
AWS
Azure
Bgp
Ceph
Ethernet
Firewalls
GCP
Gpu-To-Gpu Communication
Grafana
Infiniband
Ipoib
Ospf
Prometheus
Rdma
Roce
Tcp/Ip
Vlans
Vlans
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Santa Clara,, CA
21 Employees
Year Founded: 2023

What We Do

We are transforming how stories are told, knowledge is learned, and insights are gathered

Similar Jobs

Anduril Logo Anduril

Strategic Sourcing Manager, Indirect (R&D/MRO)

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
In-Office
Costa Mesa, CA, USA
6000 Employees
129K-171K Annually

Anduril Logo Anduril

Strategic Sourcing Manager, Indirect (CapEx)

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
In-Office
Costa Mesa, CA, USA
6000 Employees
129K-171K Annually

Anduril Logo Anduril

Strategic Sourcing Manager, Indirect (Professional Services)

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
In-Office
Costa Mesa, CA, USA
6000 Employees
129K-171K Annually

Anduril Logo Anduril

Senior Strategic Sourcing Manager, Indirect (R&D/MRO)

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
In-Office
Costa Mesa, CA, USA
6000 Employees
146K-194K Annually

Similar Companies Hiring

Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees
Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account