Senior Network & Site Reliability Engineer

Posted 4 Days Ago
Be an Early Applicant
San Francisco, CA, USA
In-Office
210K-240K Annually
Senior level
Artificial Intelligence • Marketing Tech • Software • Big Data Analytics
The Role
Design, operate, and automate global network and reliability infrastructure for large-scale ML workloads and a private supercomputer. Own device configuration management, protocols (BGP, VPNs, WAN), datacenter fabrics, monitoring/SLOs, incident response, security/compliance, and cross-team reliability improvements.
Summary Generated by Built In

About Us

Alembic is the pioneering Causal AI platform. We help the world's largest enterprises move past correlation to prove what actually drives business outcomes — the question marketing and growth teams have never been able to answer with confidence. Fortune 100 companies including Nvidia, Delta Air Lines, and Mars use Alembic to make multimillion-dollar decisions on trusted, causal evidence.

We're backed by a $145M Series B from WndrCo (founded by Jeffrey Katzenberg), Jensen Huang, Joe Montana, Prysm Capital, and Accenture. Our models run on our own NVIDIA DGX SuperPOD built on Grace Blackwell infrastructure — one of the fastest private supercomputers in the world. (We've melted GPUs getting here.)

About the Role

We're building infrastructure that has to perform under real-world scale, reliability, and security demands — and we're looking for an engineer who wants to own the foundation it runs on. This isn't a traditional "keep the lights on" role.

You'll design and operate the global network and reliability layer behind one of the world's fastest private supercomputers — the fabric powering distributed compute, ML workloads, real-time analytics, and mission-critical enterprise systems. You'll work across networking, systems, automation, observability, and reliability engineering to scale a platform where performance genuinely matters, with real influence over architecture decisions.

It's a strong fit if you like solving deep infrastructure problems, building resilient systems, automating everything repetitive, and owning architecture rather than just maintaining it.

What You'll Do
  • Architect and operate scalable, secure network architecture for high-security requirements and large-scale machine learning workloads.

  • Own network device configuration management end to end, ensuring consistency and reliability across the fleet.

  • Improve system and network reliability and performance through automation, observability, and proactive capacity planning.

  • Implement and manage complex network protocols and connectivity, including BGP, VPNs, and WAN circuits and external peering.

  • Build and maintain comprehensive monitoring, alerting, and incident response — SLOs, runbooks, and on-call rotations — and drive post-incident analysis and continuous improvement.

  • Ensure security, compliance, and operational readiness across our network and cloud infrastructure.

  • Partner across engineering and data science to drive a culture of performance and reliability.

What Will Help You Succeed
  • 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration.

  • A strong background in network security, architecture, design, and operations.

  • Extensive hands-on experience with network devices (firewalls, switches, load balancers) and large-scale architectures and protocols — BGP, QoS, MPLS, and IPsec VPNs.

  • Experience designing and operating modern datacenter network fabrics (spine-leaf, EVPN/VXLAN, ECMP).

  • Network automation and IaC tooling (Ansible, Terraform, Nornir, or similar), plus IPAM/DCIM platforms (NetBox, Infoblox, or similar).

  • WAN engineering — carrier circuit provisioning and external network peering.

  • Familiarity with Kubernetes networking (CNI plugins, ingress, service networking, network policy) and strong operational experience with Linux-based production infrastructure.

  • Experience with monitoring and observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry).

  • Solid scripting (Python, Bash) to debug complex network and system issues and automate solutions, plus excellent cross-functional communication.

Also Helpful
  • NVIDIA networking technologies — Cumulus Linux, InfiniBand, Spectrum-X, and BlueField DPUs (this is the fabric behind our SuperPOD).

  • Familiarity with data-intensive platforms (Spark, Airflow, Kafka) and storage network protocols (NFS, LustreFS, iSCSI).

  • Security practices for applications and infrastructure, and experience in high-compliance or SOC 2 environments.

The Role Is Right for You If
  • You want to own mission-critical network and infrastructure end to end — from architecture to incident management — not just keep it running.

  • You'd rather build and automate than direct from a distance, and you want meaningful influence over how a high-performance platform scales.

Why You Might Be Excited About Alembic
  • Hard problems with real impact: You'll own the network and reliability layer behind systems that influence multimillion-dollar decisions at Fortune 100 companies.

  • Cutting-edge technology: Operate our own NVIDIA DGX SuperPOD on Grace Blackwell — one of the fastest private supercomputers in the world — and run a fabric (InfiniBand, Spectrum-X, BlueField) almost no company has in-house.

  • Technical autonomy: Ownership over architecture decisions and the freedom to solve hard infrastructure problems your way.

  • Elite team: Join top engineers who thrive on hard problems and high-impact work.

  • Series B momentum, real ownership: Meaningful equity at a Series B company that's raised $145M, with proven product-market fit and Fortune 100 traction.

Why You Might Not Be Excited
  • If you only want to tell people what to build instead of building and automating alongside them, this isn't the environment for you.

  • You prefer companies with 100% built-out process for every detail.

  • You prefer static over dynamic — projects and priorities adapt as we grow. We have real paying customers and a playbook, and we still move at startup speed at Series B scale.

Skills Required

  • 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations or systems and network administration.
  • Strong background in network security, architecture, design, and operations.
  • Extensive hands-on experience with network devices: firewalls, switches, load balancers.
  • Experience with large-scale network protocols and architectures: BGP, QoS, MPLS, IPsec VPNs.
  • Experience designing and operating datacenter network fabrics: spine-leaf, EVPN/VXLAN, ECMP.
  • Network automation and IaC tooling (Ansible, Terraform, Nornir, or similar).
  • Experience with IPAM/DCIM platforms (NetBox, Infoblox, or similar).
  • WAN engineering experience including carrier circuit provisioning and external network peering.
  • Familiarity with Kubernetes networking (CNI plugins, ingress, service networking, network policy).
  • Strong operational experience with Linux-based production infrastructure.
  • Experience with monitoring and observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry).
  • Solid scripting skills (Python, Bash) for debugging and automation.
  • Excellent cross-functional communication and collaboration skills.
  • NVIDIA networking technologies (Cumulus Linux, InfiniBand, Spectrum-X, BlueField DPUs).
  • Familiarity with data-intensive platforms (Spark, Airflow, Kafka).
  • Experience with storage network protocols (NFS, LustreFS, iSCSI).
  • Experience with security practices and high-compliance (SOC 2) environments.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
San Francisco, , California
43 Employees
Year Founded: 2018

What We Do

Alembic provides AI-powered marketing analytics for C-suite executives. Our near-real-time platform leverages cutting-edge mathematics, proprietary algorithms, and composite AI, forming an innovative solution unmatched in the marketplace. Organizations use it to model revenue outcomes, addressing the long-standing challenge of quantifying the impact of marketing on sales. Alembic is proud to work with enterprise companies, including NVIDIA, Texas A&M, and North Sails

Similar Jobs

ServiceNow Logo ServiceNow

Director, Role & Org Excellence - Customer Excellence Group

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Santa Clara, CA, USA
29000 Employees
192K-337K Annually

ServiceNow Logo ServiceNow

Staff Software Engineer

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Hybrid
Mountain View, CA, USA
29000 Employees

MetLife Logo MetLife

Customer Care Advocate Disability Intake - Cary, NC 9.21.26 - 18274

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
43000 Employees
42K-42K Annually

MetLife Logo MetLife

Customer Care Advocate Disability Intake - Omaha, NE 9.14.26 - 18270

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
43000 Employees
42K-42K Annually

Similar Companies Hiring

Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account