Guild.ai

Engineer, Production Engineering

Reposted 20 Days Ago

Be an Early Applicant

San Francisco, CA, USA

In-Office

140K-320K Annually

Senior level

Artificial Intelligence • Software • Automation

The Role

Own production infrastructure, security, and compliance for an AI-agent platform: manage Kubernetes on GCP, customer VPC deployments across clouds, observability, SOC2 readiness, pentest/bug-bounty coordination, IT identity, and automated CI/CD and progressive delivery to ensure secure, reliable production at scale.

Summary Generated by Built In

Engineer — Production Engineering

Location: San Francisco Bay Area (Hybrid/Onsite)
Type: Full-time
Stage: Early-stage startup

About the Role

We are building the control plane for AI agents in teams and companies.

As a Production Engineer, you will own the infrastructure, security, and compliance systems that allow our platform to ship fast and run reliably at scale. This is not a traditional ops role — you will write real code, contribute directly to the product, and own the full security and compliance surface of an early-stage company.

You'll work across Kubernetes infrastructure, cloud delivery, agent sandboxing, SOC2 compliance, IT systems, and production observability — and you'll contribute to the product itself, building security-sensitive features and auditing application code for vulnerabilities.

If you want to own the production backbone for the agent-native era — from a Terraform module to a pentest to an API key implementation — we want to talk.

What You'll Own

1. Cloud & Kubernetes Infrastructure

Our Stack: Manage and evolve our production and staging infrastructure on GCP (GKE) using Terraform. Own DNS, networking, and environment configuration end-to-end.
Customer Environments: Deploy and operate within customer VPCs across AWS, Azure, and GCP — adapting to varied infrastructure constraints, security requirements, and enterprise networking configurations.
Agent Sandboxing: Build and maintain Kubernetes-based sandboxing for agent execution — ensuring agents operate within strict network boundaries and must route through our API gateway rather than having unfettered internet access.
Observability: Own our observability stack, including OpenTelemetry instrumentation and integrations with New Relic and Splunk, to give the team deep visibility into system performance and agent runtime behavior.

2. Security, Compliance & IT

SOC2 & Audits: Lead infrastructure and operational work to support SOC2 compliance, including audit preparation, evidence collection, and control implementation.
Penetration Testing & Bug Bounty: Manage our HackerOne engagement — coordinating pentests, triaging incoming bug bounty reports, and driving remediation.
Product Security: Audit application code for security vulnerabilities, contribute security-sensitive product features (e.g., API key management), and ensure product and infrastructure security are coherent end-to-end.
IT & Identity: Own our IT stack — Okta, device management, and access controls — keeping the company secure as we scale.

3. CI/CD & Progressive Delivery

Deployment Pipelines: Design and maintain safe, automated CI/CD workflows supporting rollout strategies like canary and blue-green deployments.
Release Velocity: Make shipping to production a routine, boring, highly automated non-event.

What We're Looking For

Strong Fit

Experience: 5+ years in Production Engineering, Platform Engineering, or a security-focused infrastructure role, ideally at a fast-growing startup or SaaS company.
Our Stack: Strong hands-on experience with Kubernetes and GCP in production; comfortable with Terraform for managing real infrastructure.
Code over Click: Strong programming skills (Python, Go, TypeScript, etc.) with a passion for automating away toil.
Security Depth: Hands-on experience with compliance frameworks (SOC2), vulnerability management, and secure system design.

Bonus Points

Background with multi-tenant SaaS or enterprise security and procurement requirements.
Exposure to AI/ML infrastructure, particularly agent runtimes.
Experience building security-sensitive product features alongside infrastructure work.
Experience supporting pentests / bug bounties
Experience deploying and operating in customer VPCs or other external cloud environments across AWS, Azure, and/or GCP — navigating enterprise networking, security, and access constraints.

Why This Role is Unique

Broad Ownership: You'll own the full security and compliance surface of an early-stage company — from SOC2 to sandboxed agent execution to IT — while also contributing directly to the product.
Agent Infrastructure: You'll design infrastructure for autonomous AI agents, not just traditional web services — introducing unique sandboxing, observability, and security challenges.
Our Infra and Theirs: You'll operate across both our own production environment and customer cloud environments, requiring you to be fluent across AWS, Azure, and GCP.
High Autonomy: As an early hire, you'll have a seat at the table to choose the tools and define the architecture that carries us to scale.

Who Thrives Here

Engineers who are as comfortable reading application code for vulnerabilities as they are writing a Terraform module.
People who enjoy owning the full security and compliance surface, not just one layer of it.
Builders who can navigate the constraints of customer enterprise environments without losing velocity.
Those who are energized — not overwhelmed — by the breadth of an early-stage technical operations role.

Skills Required

5+ years in Production Engineering, Platform Engineering, or security-focused infrastructure role
Kubernetes (GKE) production experience
GCP production experience
Terraform infrastructure as code experience
Strong programming skills (Python, Go, TypeScript or similar)
Hands-on SOC2/compliance experience and audit support
Vulnerability management and secure system design experience
Observability experience (OpenTelemetry, New Relic, Splunk)
CI/CD and progressive delivery experience (canary, blue-green)
Experience deploying/operating in customer VPCs across AWS, Azure, or GCP
Penetration testing / bug bounty program experience (HackerOne)
Experience with IT identity and device management (Okta)

View all jobs at Guild.ai

View Guild.ai Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Denver, CO

25 Employees

Year Founded: 2025

What We Do

Guild turns agents into shared production infrastructure, with a managed software center for trusted agent capabilities, and an agent hub for discovering and sharing agents. For Enterprises. AI, Trusted in Production Autonomous software requires the same guardrails as any production system. Guild enforces centralized identity, least-privilege access, and immutable audit logging so enterprise governance extends to AI agents. Agents can act on code, tickets, and operational workflows without bypassing identity controls or becoming a black box. For Developers. AI, Built Like Real Software Guild gives developers the primitives they expect: typed interfaces, versioned releases, safe execution boundaries, and full execution traces, so agents behave like systems, not scripts. The Agent Hub is a public GitHub-like platform for broad discovery and reuse of agents, allowing developers to build agents like real software and ship them as products. One Platform. Any Model Universal by design. Guild is neutral toward models, vendors, and frameworks, doesn’t lock governance into a single stack, and works with Anthropic, OpenAI, Google, and open-source models. Companies can run agents via chat, APIs, webhooks, and schedules, as well as publish trusted capabilities to version, reuse, and improve - so teams don't start from zero. Access can be controlled centrally, and usage tracked by workspace, user, agent, and trigger.