Distinguished Engineer - Inference Serving Network and Storage

Posted Yesterday
Be an Early Applicant
Austin, TX, USA
Hybrid
Expert/Leader
Artificial Intelligence • Semiconductor
Joining Graphcore gives you a seat at the top-table, shaping the future of Artificial Intelligence.
The Role
Lead the networking and storage architecture for inference serving, defining strategy and technical direction for large-scale AI services.
Summary Generated by Built In
About us

Graphcore is a globally recognized leader in Artificial Intelligence computing systems. The company designs advanced semiconductors and data center hardware that provide the specialized processing power needed to drive AI innovation, while delivering the efficiency required to support its broader adoption.

As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies.

Job Summary

We are seeking a Distinguished Engineer to lead the networking and storage architecture for a new inference serving initiative. This is a chief technologist role for the serving fabric and data path, responsible for defining and driving the end-to-end strategy for networking, storage, observability, provisioning, and automation in support of large-scale AI inference services.

You will shape core technical decisions that directly influence product capability, service differentiation, and competitive advantage. On the networking side, you will lead the design of the serving fabric, inter-partition latency path, management network, QoS and transport tuning, segmentation, observability, and automation. In terms of storage, you will define the architecture for model artifact storage, checkpoint distribution, KV and session tiering and restore, telemetry and log storage, and backup and disaster recovery.  

Storage is expected to be a critical component of inference serving at scale, particularly for KV cache management, state movement, and service resiliency. You will therefore set technical direction across both networking and storage domains as first-class pillars of the platform.

This is a Grade 7 role for a recognized expert and thought leader who can convert strategic thinking into tangible group-level impact, lead a small team, and have influence across functions and external partners.

The Team

You will be in the System Engineering group and work across organizational boundaries with ML software, applied AI, hardware and systems, inference service teams, and other platform and infrastructure groups. You will also engage closely with external partners responsible for key elements of the inference service stack, as well as business counterparts who depend on differentiated service capabilities, reliability, and scale.

This role requires strong technical leadership without relying solely on formal authority. You will be expected to align stakeholders, make architectural trade-offs clear, and drive execution across multiple teams while raising the technical bar for the broader organization.

Responsibilities and Duties
  • Define and coordinate the networking architecture for inference serving, including serving fabric build, inter-partition latency path optimization, and management network architecture.  
  • Lead the strategy for QoS, transport tuning, traffic isolation, segmentation, and service differentiation to support multiple inference SLAs and workload classes.
  • Drive the build of monitoring, resource prioritization, and automated management frameworks for network and storage systems at production scale.  
  • Define the storage architecture for model artifact repositories, checkpoint distribution, session state, telemetry and log storage, backup, and disaster recovery.
  • Lead the design of KV cache storage, tiering, restore, and movement mechanisms as a core platform capability for large-scale inference serving.
  • Optimize network and storage subsystems for demanding AI and HPC workloads, balancing throughput, latency, resiliency, cost, and operational simplicity.
  • Work with ML software and inference service teams to develop infrastructure that supports current methods for deploying large language models. Methods include disaggregated prefill/decode paths, continuous batching, and large-model scaling techniques.  
  • Guide architecture for scaling models that use tensor, pipeline, expert, and other parallelism strategies, ensuring the serving infrastructure supports efficient execution and state movement.
  • Establish performance models, benchmarks, and tuning methodologies for end-to-end serving behavior, including tail latency, throughput stability, and recovery characteristics.
  • Lead a small multi-functional team while providing technical direction and architectural oversight across a wider matrixed organization.
  • Influence roadmap, standards, and implementation choices across internal teams and external partners.
  • Act as the senior technical authority for this domain, identifying risks early, resolving complex trade-offs, and ensuring the platform evolves in line with business and product needs.
Candidate Profile

Essentials


  • MS or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent practical experience.
  • Significant industry experience, typically 15+ years, in large-scale systems, distributed infrastructure, or platform architecture.
  • Deep expertise in networking and storage software at scale, including architecture, implementation, configuration, and performance optimization.
  • Proven experience designing and operating networking and storage systems for demanding applications in AI, HPC, or large-scale cloud environments.
  • Strong understanding of high-performance transport, congestion and flow control, QoS, segmentation, telemetry, and production observability.
  • Strong understanding of distributed storage architectures, artifact distribution, checkpointing, caching, replication, backup, disaster recovery, and operational resilience.
  • Demonstrated ability to architect low-latency, high-throughput systems where network and storage behavior materially affect application performance.
  • Experience leading highly ambiguous, cross-functional technical initiatives with impact across multiple teams or product areas.
  • Strong communication and influencing skills, with the ability to align senior technical and business stakeholders.
  • Track record as a recognized expert who drives strategy, shapes technical direction, and delivers solutions beyond existing precedents.

Desirable

  • Familiarity with innovative LLM serving techniques and infrastructure requirements.
  • Experience with prefill/decode disaggregated inference, continuous batching, and differentiated inference services with multiple SLA and QoS tiers.
  • Understanding of model scaling and serving approaches involving tensor, pipeline, expert, and related parallelism techniques.
  • Experience with KV cache management, tiering, restore, and memory/storage trade-offs in inference systems.
  • Knowledge of modern inference serving algorithms, schedulers, and system-level optimization techniques.
  • Experience working with external technology partners, suppliers, or ecosystem collaborators in the delivery of complex infrastructure platforms.
  • Background in production-grade automation and provisioning systems for large infrastructure estates


Benefits
  
In addition to a competitive salary, Graphcore offers a competitive benefits package. We welcome people of different backgrounds and experiences; we’re committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.   

Top Skills

AI
Automation Systems
Distributed Infrastructure
Hpc
Llm Serving Techniques
Monitoring Frameworks
Networking
Storage Systems

What the Team is Saying

Monika
Dionysia
Dave
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Bristol
488 Employees
Year Founded: 2016

What We Do

At Graphcore, we’re building the future of AI compute. We’re a team of semiconductor, software and AI experts, with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter scale. As part of the SoftBank Group, backed by significant long-term investment, we are delivering key technology into the fast-growing SoftBank AI ecosystem. To meet the vast and exciting AI opportunity, Graphcore is expanding its teams around the world. We are bringing together the brightest minds to solve the toughest problems, in a place where everyone has the opportunity to make an impact on the company, our products and the future of artificial intelligence.

Why Work With Us

Our team is at the forefront of the machine intelligence revolution, enabling innovators from all industries to build AI-native products to expand human potential. What we do at Graphcore really makes a difference.

Gallery

Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery

Graphcore Offices

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

At Graphcore, we value wellbeing and flexibility to support a healthy work/life balance. Our hybrid approach encourages office-based colleagues to work onsite three days a week, with trusted flexibility built on trust and transparency for everyone.

Typical time on-site: 3 days a week
HQHeadquarters
Austin Office
Bengaluru Office
Cambridge Office
Gdańsk Office
Hsinchu Office
London Office
Learn more

Similar Jobs

Graphcore Logo Graphcore

People Lead – US

Artificial Intelligence • Semiconductor
Hybrid
Austin, TX, USA
488 Employees

Graphcore Logo Graphcore

Senior Thermal Engineer

Artificial Intelligence • Semiconductor
Hybrid
Austin, TX, USA
488 Employees

Graphcore Logo Graphcore

Firmware Engineer

Artificial Intelligence • Semiconductor
Hybrid
2 Locations
488 Employees

Graphcore Logo Graphcore

Hardware Validation Manager

Artificial Intelligence • Semiconductor
Hybrid
2 Locations
488 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account