Graphcore is a globally recognized leader in Artificial Intelligence computing systems. The company designs advanced semiconductors and data center hardware that provide the specialized processing power needed to drive AI innovation, while delivering the efficiency required to support its broader adoption.
As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies.
Job SummaryWe are seeking a Distinguished Engineer to lead the networking and storage architecture for a new inference serving initiative. This is a chief technologist role for the serving fabric and data path, responsible for defining and driving the end-to-end strategy for networking, storage, observability, provisioning, and automation in support of large-scale AI inference services.
You will shape core technical decisions that directly influence product capability, service differentiation, and competitive advantage. On the networking side, you will lead the design of the serving fabric, inter-partition latency path, management network, QoS and transport tuning, segmentation, observability, and automation. In terms of storage, you will define the architecture for model artifact storage, checkpoint distribution, KV and session tiering and restore, telemetry and log storage, and backup and disaster recovery.
Storage is expected to be a critical component of inference serving at scale, particularly for KV cache management, state movement, and service resiliency. You will therefore set technical direction across both networking and storage domains as first-class pillars of the platform.
This is a Grade 7 role for a recognized expert and thought leader who can convert strategic thinking into tangible group-level impact, lead a small team, and have influence across functions and external partners.
The TeamYou will be in the System Engineering group and work across organizational boundaries with ML software, applied AI, hardware and systems, inference service teams, and other platform and infrastructure groups. You will also engage closely with external partners responsible for key elements of the inference service stack, as well as business counterparts who depend on differentiated service capabilities, reliability, and scale.
This role requires strong technical leadership without relying solely on formal authority. You will be expected to align stakeholders, make architectural trade-offs clear, and drive execution across multiple teams while raising the technical bar for the broader organization.
Responsibilities and Duties- Define and coordinate the networking architecture for inference serving, including serving fabric build, inter-partition latency path optimization, and management network architecture.
- Lead the strategy for QoS, transport tuning, traffic isolation, segmentation, and service differentiation to support multiple inference SLAs and workload classes.
- Drive the build of monitoring, resource prioritization, and automated management frameworks for network and storage systems at production scale.
- Define the storage architecture for model artifact repositories, checkpoint distribution, session state, telemetry and log storage, backup, and disaster recovery.
- Lead the design of KV cache storage, tiering, restore, and movement mechanisms as a core platform capability for large-scale inference serving.
- Optimize network and storage subsystems for demanding AI and HPC workloads, balancing throughput, latency, resiliency, cost, and operational simplicity.
- Work with ML software and inference service teams to develop infrastructure that supports current methods for deploying large language models. Methods include disaggregated prefill/decode paths, continuous batching, and large-model scaling techniques.
- Guide architecture for scaling models that use tensor, pipeline, expert, and other parallelism strategies, ensuring the serving infrastructure supports efficient execution and state movement.
- Establish performance models, benchmarks, and tuning methodologies for end-to-end serving behavior, including tail latency, throughput stability, and recovery characteristics.
- Lead a small multi-functional team while providing technical direction and architectural oversight across a wider matrixed organization.
- Influence roadmap, standards, and implementation choices across internal teams and external partners.
- Act as the senior technical authority for this domain, identifying risks early, resolving complex trade-offs, and ensuring the platform evolves in line with business and product needs.
Essentials
- MS or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent practical experience.
- Significant industry experience, typically 15+ years, in large-scale systems, distributed infrastructure, or platform architecture.
- Deep expertise in networking and storage software at scale, including architecture, implementation, configuration, and performance optimization.
- Proven experience designing and operating networking and storage systems for demanding applications in AI, HPC, or large-scale cloud environments.
- Strong understanding of high-performance transport, congestion and flow control, QoS, segmentation, telemetry, and production observability.
- Strong understanding of distributed storage architectures, artifact distribution, checkpointing, caching, replication, backup, disaster recovery, and operational resilience.
- Demonstrated ability to architect low-latency, high-throughput systems where network and storage behavior materially affect application performance.
- Experience leading highly ambiguous, cross-functional technical initiatives with impact across multiple teams or product areas.
- Strong communication and influencing skills, with the ability to align senior technical and business stakeholders.
- Track record as a recognized expert who drives strategy, shapes technical direction, and delivers solutions beyond existing precedents.
Desirable
- Familiarity with innovative LLM serving techniques and infrastructure requirements.
- Experience with prefill/decode disaggregated inference, continuous batching, and differentiated inference services with multiple SLA and QoS tiers.
- Understanding of model scaling and serving approaches involving tensor, pipeline, expert, and related parallelism techniques.
- Experience with KV cache management, tiering, restore, and memory/storage trade-offs in inference systems.
- Knowledge of modern inference serving algorithms, schedulers, and system-level optimization techniques.
- Experience working with external technology partners, suppliers, or ecosystem collaborators in the delivery of complex infrastructure platforms.
- Background in production-grade automation and provisioning systems for large infrastructure estates
Benefits
In addition to a competitive salary, Graphcore offers a competitive benefits package. We welcome people of different backgrounds and experiences; we’re committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.
Top Skills
What We Do
At Graphcore, we’re building the future of AI compute. We’re a team of semiconductor, software and AI experts, with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter scale. As part of the SoftBank Group, backed by significant long-term investment, we are delivering key technology into the fast-growing SoftBank AI ecosystem. To meet the vast and exciting AI opportunity, Graphcore is expanding its teams around the world. We are bringing together the brightest minds to solve the toughest problems, in a place where everyone has the opportunity to make an impact on the company, our products and the future of artificial intelligence.
Why Work With Us
Our team is at the forefront of the machine intelligence revolution, enabling innovators from all industries to build AI-native products to expand human potential. What we do at Graphcore really makes a difference.
Gallery
Graphcore Offices
Hybrid Workspace
Employees engage in a combination of remote and on-site work.
At Graphcore, we value wellbeing and flexibility to support a healthy work/life balance. Our hybrid approach encourages office-based colleagues to work onsite three days a week, with trusted flexibility built on trust and transparency for everyone.





