The next generation of applications will be long-running, integration-heavy, and increasingly autonomous.
As systems become more agentic and failure-prone, durable execution transforms reliability from an application concern into a platform capability.
Restate sits at the center of that shift.
Restate (restate.dev) turns AI agents, workflows, and backend services into durable processes, allowing developers to focus on business logic rather than retries, state recovery, and failure handling.
We're looking for a Senior to Staff-level Cloud Infrastructure Engineer to own the infrastructure platform that powers Restate across open source, self-hosted deployments, multi-tenant SaaS, and Bring Your Own Cloud environments.
This is a high-ownership role at the intersection of cloud infrastructure, distributed systems, and platform engineering.
What you'll ownThe infrastructure and control plane powering Restate Cloud.
The systems enabling BYOC and self-hosted deployments.
Reliability, observability, and operational excellence across the fleet.
Automation and tooling that allow the platform to scale efficiently.
Production operations and participation in the cloud on-call rotation.
You will operate across networking, storage, Kubernetes, cloud APIs, observability, and infrastructure automation while owning systems end-to-end from design through production.
Why this role is interestingDurable execution is becoming foundational infrastructureAs AI systems become increasingly long-running and stateful, durable runtimes are emerging as a core infrastructure primitive.
You'll help build the platform enabling that transition.
Build infrastructure from first principlesRestate reimagines durable execution as a lightweight, self-contained runtime:
Single Rust binary deployment
Custom storage layer
Low-latency orchestration engine
Integrated observability
No external database dependency
Restate powers workloads running inside Fortune 500 enterprises, financial institutions, and AI-native startups building production-grade agents and workflows.
The systems you build operate in environments where correctness and operational simplicity are mission critical.
Work with exceptional engineersYou'll work directly with engineers who built foundational distributed systems at global scale, including creators of Apache Flink and leaders from Meta's messaging infrastructure teams.
What we're looking forMust haveExperience operating large-scale SaaS or platform infrastructure in production.
Deep understanding of cloud infrastructure and Kubernetes-based systems.
Experience with infrastructure-as-code and cloud automation.
Strong software engineering skills in Rust, Go, or C++.
Comfort owning systems from design through operations.
Ability to thrive in ambiguity and operate with significant autonomy.
Kubernetes operator development.
Cluster API, Crossplane, or Terraform experience.
Experience with multi-tenant control planes.
Experience operating infrastructure in enterprise or compliance-sensitive environments.
Familiarity with durable execution systems or workflow runtimes.
You prefer working on runtime internals rather than cloud infrastructure.
You prefer architecture and review work over hands-on ownership.
You are uncomfortable operating production infrastructure or participating in on-call rotations.
Fully remote within the United States.
East Coast candidates are preferred to improve on-call coverage.
Minimal travel requirements.
Skills Required
- Strong cloud infrastructure background with deep understanding of major cloud provider architectures
- Experience with infrastructure-as-code and cloud orchestration, particularly Kubernetes-based stateful workloads
- Software engineering skills in a systems language (Rust, Go, C++)
- Willingness and ability to learn Rust on the job
- Prior experience operating production SaaS or platform infrastructure
- Comfortable taking ownership end-to-end from design through production operations; hands-on mentality
- Participate in the cloud on-call rotation
- US-based (fully remote)
- Prior experience with Restate or durable execution specifically
- Deep enterprise procurement/compliance navigation
- Kubernetes operator development and IaC systems like Cluster API, Crossplane, or Terraform
What We Do
Restate provides a lightweight runtime that enables developers to build innately resilient distributed applications. By turning AI agents, workflows, and backend services into durable processes, Restate removes the complexity of managing failure mechanics, allowing developers to focus on their business logic rather than the underlying infrastructure's resilience.









