Senior Infrastructure Engineer

Posted 8 Days Ago
Easy Apply
Be an Early Applicant
San Francisco, CA
In-Office
200K-350K Annually
Senior level
Artificial Intelligence • Software • Database
The Role
Design, build, and operate large-scale Kubernetes clusters and custom operators (CRDs) to orchestrate thousands of persistent AI agent workloads. Drive cluster scaling, autoscaling, scheduling, storage, networking, observability, and IaC. Troubleshoot distributed infrastructure, set best practices, and partner with backend, ML, and research teams to deliver a production-grade platform for long-running scientific workloads.
Summary Generated by Built In
About

Edison Scientific builds and commercializes AI agents for science. Scientific discovery moves too slowly, and autonomous AI agents are how we intend to fix that. We're assembling a team of top researchers and engineers across AI and biology to build an AI scientist.

Role

As a Senior Infrastructure Engineer, you'll play a key role in designing, scaling, and operating the core platform infrastructure that powers autonomous scientific discovery. Your primary focus will be the orchestration for our agents at scale — building and managing clusters that orchestrate thousands of persistent, stateful workloads, developing custom resource definitions (CRDs) and operators, and ensuring the reliability and efficiency of our compute layer at scale.

Our mission is to build an AI scientist, and you'll own the infrastructure foundation it runs on. AI agents performing long-running scientific research demand resilient scheduling, lifecycle management, and resource orchestration far beyond typical cloud-native workloads. This role will influence platform architecture, establish infrastructure best practices, and partner closely with backend engineers, ML engineers, and researchers to deliver a production-grade environment that lets science move faster.

At Edison Scientific, engineering at the senior level is about technical ownership and leverage- understanding how complex systems interact, making sound architectural tradeoffs, and building foundations that allow teams and science to move faster.

This role is on-site at our San Francisco office in the Dogpatch neighborhood. Our office is a converted warehouse with high ceilings, open space, and a team that genuinely believes in what they're building.

Responsibilities
  • Architect, implement, and operate Kubernetes clusters that support thousands of concurrent, persistent resources (agents, jobs, services) with high availability and efficient resource utilization.
  • Design and develop custom resource definitions (CRDs) and Kubernetes operators to model and manage domain-specific workloads such as AI agent lifecycles, research pipelines, and long-running compute tasks.
  • Drive the strategy for cluster scaling, node pool management, autoscaling policies, and resource quota frameworks to handle rapid workload growth.
  • Build and maintain infrastructure-as-code (Terraform, Pulumi, or similar) for reproducible, version-controlled environment management.
  • Design and implement robust scheduling, placement, and affinity strategies to optimize cost, performance, and fault tolerance for heterogeneous workloads (CPU, GPU, memory-intensive).
  • Establish and uphold best practices around observability, monitoring, alerting, and incident response for infrastructure systems (Prometheus, Grafana, Datadog, or similar).
  • Own storage and networking strategy within Kubernetes — including persistent volume management, CSI drivers, service mesh, network policies, and ingress architecture.
  • Troubleshoot complex, cross-system infrastructure issues and guide others through effective debugging and remediation in distributed environments.
  • Collaborate closely with backend, ML, and research teams to understand workload requirements and translate them into reliable infrastructure patterns.
Qualifications
  • 5+ years of professional infrastructure or platform engineering experience, with deep hands-on Kubernetes expertise in production environments.
  • Experience designing and implementing custom resource definitions (CRDs) and Kubernetes operators (using frameworks such as Kubebuilder, Operator SDK, or controller-runtime).
  • Track record of operating and scaling Kubernetes clusters supporting thousands of persistent or long-lived resources (stateful workloads, persistent pods, long-running jobs).
  • Deep understanding of Kubernetes internals — API server, etcd, scheduler, controller manager, kubelet — and how they behave at scale.
  • Expertise with cloud infrastructure (AWS EKS, GCP GKE, or Azure AKS) and associated networking, storage, and IAM primitives.
  • Proficiency in at least one systems or backend language for operator development and infrastructure tooling.
  • Hands-on experience with infrastructure-as-code tools (Terraform, Pulumi, or Crossplane) and GitOps workflows.
  • Strong working knowledge of container networking (CNI plugins, service mesh, network policies), storage (CSI, persistent volumes, StatefulSets), and security (RBAC, Pod Security Standards, secrets management).
  • Ability to operate autonomously, make sound technical judgments, and drive projects from concept through production.
Bonus points for:
  • Experience with data-intensive platforms, scientific computing, or ML/AI infrastructure.
  • Prior experience in startups or small teams with significant architectural ownership and ambiguity.
  • Experience scaling systems, teams, or platforms through periods of rapid growth.
Salary

$200,000 - $350,000  •  Offers Equity

Why join us?
  • Competitive salary and equity
  • Full healthcare coverage — we pay 100% of premiums for you and your dependents
  • Support for growing families, including a yearly new parent stipend and fertility coverage through Carrot
  • 401(k) company matching
  • $300 health and wellness benefit
  • Lunch is on us every day you're in the office, and dinner is on us when you're working late
  • Regular team offsites and company events
  • A fast-moving, mission-driven culture where smart people do their best work and actually enjoy doing it

Top Skills

Api Server
Aws Eks
Azure Aks
Cni Plugins
Controller Manager
Controller-Runtime
Crds
Crossplane
Csi Drivers
Datadog
Etcd
Gcp Gke
Gitops
Grafana
Kubebuilder
Kubelet
Kubernetes
Operator Sdk
Operators
Persistentvolumes
Pod Security Standards
Prometheus
Pulumi
Rbac
Scheduler
Secrets Management
Service Mesh
Statefulsets
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
47 Employees
Year Founded: 2025

What We Do

Spun out from FutureHouse in 2025, Edison Scientific accelerates discovery and innovation across the sciences. Our platform empowers researchers to move from question to breakthrough faster than ever, automating literature synthesis, data analysis, and molecular design. At its core is Kosmos, our AI scientist, capable of running hundreds of research tasks in parallel. It transforms raw datasets into comprehensive, validated reports—compressing months of work into a single run. With Edison Scientific, scientists remain in control, using AI to amplify their expertise and accelerate discovery at unprecedented speed.

Similar Jobs

General Motors Logo General Motors

Infrastructure Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
4 Locations
165000 Employees
155K-396K Annually

Gusto Logo Gusto

Staff Engineer

Fintech • HR Tech
Easy Apply
Hybrid
3 Locations
4405 Employees
189K-278K Annually

CoreWeave Logo CoreWeave

Senior Software Engineer

Cloud • Information Technology • Machine Learning
In-Office
4 Locations
1450 Employees
165K-242K Annually

Verkada Inc Logo Verkada Inc

Staff Software Engineer

Cloud • Hardware • Security • Software
In-Office
San Mateo, CA, USA
2000 Employees
130K-280K Annually

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account