AI Infrastructure Engineering (Cloud, DevOps)

Posted 2 Days Ago
Be an Early Applicant
San Francisco, CA, USA
In-Office
150K-300K Annually
Senior level
Artificial Intelligence • Software • Cybersecurity • Generative AI
The Role
Own and operate Virtue AI's production AI infrastructure: design deployment workflows and IaC, package containerized services, build CI/CD (GitHub Actions), serve and optimize LLM inference pipelines, implement observability and secure networking, and collaborate with product teams to ensure reliable, scalable production systems.
Summary Generated by Built In

Location: San Francisco, CA (Onsite | Remote)

About Virtue AI

Virtue AI sets the standard for advanced AI security platforms. Built on decades of foundational and award-winning research in AI security, its AI-native architecture unifies automated red-teaming, real-time multimodal guardrails, and systematic governance for enterprise apps and agents. Deploy in minutes—across any environment—to keep your AI protected and compliant. We are a well-funded, early-stage startup founded by industry veterans, and we're looking for passionate builders to join our core team.

What You’ll Do

As an AI infra Engineer, you will own the reliability, scaling, automation, and operational discipline of Virtue AI’s AI production systems, focusing on deployment and model serving performance.

You will:

  • Design and maintain deployment workflows for Virtue AI on major cloud providers (e.g., AWS and GCP)

  • Own IaC (Terraform / Pulumi) for repeatable, auditable customer deployments.

  • Package our services into secure, customer-ready deployment units (Docker, Helm, Marketplace images).

  • Design, build, and maintain product CI/CD pipelines using GitHub Actions.

  • Serve and optimize the LLM inference pipeline; build necessary inference APIs and routers; auto-scaling

  • Design production-grade system observability (Metrics, logs, alerts, dashboards) using tools like Datadog, Grafana, and Prometheus.

  • Implement secure networking (VPCs, IAM, service accounts, private endpoints, firewalling).

  • Collaborate with product developers to align infrastructure and inference behavior with product requirements.

Required Qualifications

  • Bachelor’s degree or higher in CS, CE, EE, or related field.

  • Strong experience deploying production systems on major cloud platforms, e.g., AWS and/or GCP.

  • Deep hands-on experience with Docker and containerized workloads, Kubernetes (EKS, GKE, or equivalent).

  • Strong experience serving LLMs and embedding models in production.

  • Strong hands-on experience with CI/CD (GitHub Actions required) and repository management (monorepos, release branches, tagging, rollbacks).

Preferred Qualifications

  • Experience with SGLang, vLLM, or similar inference frameworks.

  • Strong understanding of GPU behavior (memory limits, batching, fragmentation, utilization) and experience with GPU-level optimization

  • Experience with model-level inference optimization (Quantization, KV-cache optimization, Speculative decoding or batching strategies) and inference kernels

  • Startup experience: you move fast, take ownership, and fix things properly.

Why Join Virtue AI

  • Competitive salary + equity

  • High ownership – You define how production runs

  • Real impact – Your work directly affects customers and revenue

  • Hard problems – Distributed systems, GPUs, scale, security

  • Strong technical peers – Engineers who ship and debug, not just designLocation: San Francisco, CA (Onsite | Remote)

About Virtue AI

Virtue AI sets the standard for advanced AI security platforms. Built on decades of foundational and award-winning research in AI security, its AI-native architecture unifies automated red-teaming, real-time multimodal guardrails, and systematic governance for enterprise apps and agents. Deploy in minutes—across any environment—to keep your AI protected and compliant. We are a well-funded, early-stage startup founded by industry veterans, and we're looking for passionate builders to join our core team.

What You’ll Do

As a DevOps Engineer, you will own the reliability, automation, and operational discipline of Virtue AI’s production systems. When something breaks, you fix it. When it doesn’t scale, you redesign it.

You will:

  • Design, build, and maintain CI/CD pipelines using GitHub Actions

  • Own repo structure, branching strategy, release workflows, and versioning

  • Build and operate Kubernetes infrastructure on GKE

  • Package, deploy, and optimize services using Docker

  • Design production-grade system observability

    • Metrics, logs, alerts, dashboards

    • Datadog, Grafana, Prometheus

  • Monitor and improve service reliability, latency, and uptime

  • Debug real production issues across infra, networking, containers, and code

  • Partner with backend, ML, and platform engineers to remove operational bottlenecks

What Makes You a Great Fit

You don’t just “set up pipelines.” You understand why systems fail, and you design so they don’t fail the same way twice.

Required Qualifications

  • Bachelor’s degree or equivalent practical experience

  • Strong hands-on experience with:

    • CI/CD (GitHub Actions required)

    • Repository management (monorepos, release branches, tagging, rollbacks)

  • Deep experience with:

    • Kubernetes

    • Docker

  • Experience designing and operating observability systems

    • Datadog and/or Grafana in production

  • Strong understanding of system design

    • Availability, scalability, fault isolation

  • Proven ability to solve real production problems, not just configure tools

  • Comfortable working directly on production systems

Preferred Qualifications

  • Experience operating ML / LLM inference systems

  • Experience with GPU workloads and resource scheduling

  • Experience supporting enterprise customers with SLAs

  • Familiarity with infrastructure-as-code (Terraform / Pulumi)

  • Startup experience: you move fast, take ownership, and clean up after yourself

Why Join Virtue AI

  • Competitive salary + equity

  • High ownership – You define how production runs

  • Real impact – Your work directly affects customers and revenue

  • Hard problems – Distributed systems, GPUs, scale, security

  • Strong technical peers – Engineers who ship and debug, not just design

Skills Required

  • Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related field
  • Production deployments on major cloud platforms (AWS and/or GCP)
  • Docker and containerized workloads
  • Kubernetes (EKS, GKE, or equivalent)
  • Serving LLMs and embedding models in production
  • CI/CD pipelines using GitHub Actions
  • Repository management (monorepos, release branches, tagging, rollbacks)
  • Designing and operating observability (metrics, logs, alerts, dashboards) with Datadog, Grafana, Prometheus
  • Comfortable debugging and working directly on production systems; strong system design for availability and scalability
  • Experience with IaC (Terraform / Pulumi)
  • Experience with SGLang, vLLM, or similar inference frameworks
  • GPU behavior and GPU-level optimization experience (memory, batching, utilization)
  • Model-level inference optimization (quantization, KV-cache optimization, speculative decoding, batching)
  • Startup experience and ownership mentality
  • Experience supporting enterprise customers and SLAs
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
44 Employees

What We Do

Virtue AI is an AI-native security, compliance, and governance platform dedicated to building scalable, trustworthy, and responsible AI systems. Its architecture unifies automated red-teaming, real-time multimodal guardrails, and policy-driven governance to safeguard enterprise agents, models, and apps across text, code, image, video, and audio in over 100 languages, detecting risks in sub-10 ms.

Similar Jobs

ServiceNow Logo ServiceNow

Program Manager

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Santa Clara, CA, USA
29000 Employees
138K-241K Annually

MetLife Logo MetLife

Customer Care Advocate AMS Service - Omaha, NE 9.21.26 - 18275

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
43000 Employees
42K-42K Annually

MetLife Logo MetLife

Customer Care Advocate Disability Intake - Cary, NC 9.14.26 - 18272

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
43000 Employees
42K-42K Annually

MetLife Logo MetLife

Customer Care Advocate Disability Intake - Cary, NC 9.21.26 - 18274

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
43000 Employees
42K-42K Annually

Similar Companies Hiring

Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
LTX Thumbnail
Conversational AI • Generative AI
Jerusalem, Israel
360 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account