Site Reliability Engineer

Reposted 4 Days Ago
Emeryville, CA, USA
Hybrid
Entry level
Artificial Intelligence • Machine Learning • Biotech • Generative AI
The Role
The Site Reliability Engineer will manage digital infrastructure, ensuring access to compute resources, automating processes, and maintaining resource visibility for researchers.
Summary Generated by Built In
About Astera

Astera is a private foundation with a $2.5B endowment on a mission to steer science and technology toward an abundant future for all. Unlike traditional foundations, we operate like a high-velocity startup with unprecedented access to computational resources and complete freedom from funding pressures or profit motives. This allows us to focus on ambitious goals and attract incredibly creative scientists and engineers from leading academic institutions and from frontier AI labs.

Neuro-AI is our large-scale AI research program, pursuing a neuroscience-informed approach to engineering AGI. This is not yet-another-lab scaling LLMs in a hope of achieving general intelligence. We are integrating neuroscience, AI, and bioengineering to understand and digitally model the architecture of the human brain.

Position Summary

We are looking for a Site Reliability Engineer to own the digital infrastructure that powers our research.

This includes compute resources that we rent from third parties, container registries, and dashboards. The main objective is to make sharing these resources easy and efficient, ensuring the infrastructure is reliable and accessible to the right people.

This role spans a broad spectrum of activities:

  • Compute Access: Ensure easy and efficient access to compute resources for our researchers.

  • Resource Visibility: Provide clear visibility into resource utilization and cluster health.

  • Auto-Scaling: Enable automatic scaling of compute resources based on demand.

  • Access Management: Ensure the right people have access to the right resources.

  • Reproducibility: Drive towards deterministic deployments and reproducible research environments.

  • Process Automation: Automate operational processes where it makes sense to increase efficiency.

  • Current stack: Ansible, Kubernetes, Docker, Tailscale, Python, Grafana, Prometheus, and Talos Linux. We're not religious about any of it.

Qualifications
  • Ownership: You are comfortable being the person accountable when the cluster is unhealthy or capacity is tight.

  • Systems Intuition: You understand how schedulers, containers, networking, storage, and hardware interact. You can reason about failure modes and design systems that degrade predictably.

  • Operational Rigor: You value observability, reproducibility, and clear operational boundaries. You leave systems in a state that other engineers can understand, operate, and debug without you.

  • Pragmatism: You can support experimental research workloads without forcing everything into a rigid "production" mold. You know when to stabilize and when to allow controlled chaos to speed up discovery.

Location & Visa
  • This role is in-person in Emeryville, CA.

  • Visa sponsorship may be available for qualified candidates.

Skills Required

  • Experience with Docker and Kubernetes
  • Familiarity with Ansible and Python
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
57 Employees
Year Founded: 2020

What We Do

Astera is a private foundation with a $2.5B endowment focused on steering science and technology toward an abundant future for all. They operate like a high-velocity startup, integrating neuroscience, AI, and bioengineering for AGI research.

Similar Jobs

Superhuman Logo Superhuman

Site Reliability Engineer

Artificial Intelligence • Information Technology • Machine Learning • Natural Language Processing • Productivity • Software • Generative AI
Hybrid
San Francisco, CA, USA
1500 Employees
214K-260K Annually

BAE Systems, Inc. Logo BAE Systems, Inc.

Site Reliability Engineer

Aerospace • Hardware • Information Technology • Security • Software • Cybersecurity • Defense
Hybrid
San Diego, CA, USA
40000 Employees
133K-226K Annually

Zscaler Logo Zscaler

Site Reliability Engineer

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Hybrid
San Jose, CA, USA
8697 Employees
119K-170K Annually

Navan Logo Navan

Site Reliability Engineer

Fintech • Information Technology • Payments • Productivity • Software • Travel • Automation
Easy Apply
Hybrid
Palo Alto, CA, USA
3300 Employees
86K-192K Annually

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account