AI Infrastructure Engineer

Reposted Yesterday
Be an Early Applicant
Hiring Remotely in Vietnam
Remote or Hybrid
Mid level
Security • Software • Generative AI
Trust, safety, and security for the GenAI era 🛡️
The Role
The AI Infrastructure Engineer will design, build, and maintain the internal AI infrastructure, ensuring production-grade reliability, observability, and seamless integration of AI agents into workflows.
Summary Generated by Built In
Description

Alice is building its internal AI infrastructure layer from the ground up. We have real agents running in production, a growing base of employees using AI in their daily work, and a clear architectural direction. What we don't have yet is a dedicated engineer to own it.

You'll be the first. Your job is to close the gap between "working prototype" and "production platform" - owning the foundation that hosts our agents, the pipelines that ship them, and the reliability layer (observability, cost controls, audit trails, evals) that makes it safe to run AI at scale in a trust & safety company.

This is an infrastructure-first role with deep AI fluency - not a prompt engineer, not a wrapper-framework operator, not a no-code builder. You should be equally comfortable writing a Terraform module, debugging a Kubernetes pod, and tracing an agent's tool-call chain.

We don’t operate with a predefined backlog here; you will be responsible for identifying high-impact needs and bringing them to life. The perfect fit for this role has a track record of deploying agentic systems that have held up under real-world usage, balances a focus on infrastructure with a deep concern for user experience, and recognizes that the primary hurdle in AI integration is rarely the model itself.

Responsibilities:

Platform & Infrastructure

  • Architect, build, and run the AWS/Kubernetes platform that hosts Alice's internal AI agents and tools; drive AWS Well-Architected pillars (operational excellence, security, reliability, performance, cost, sustainability).
  • Own Infrastructure-as-Code: Terraform modules, standards, and reviews for Bedrock, agent runtimes, vector DBs, and supporting services.

AI Systems

  • Design and ship production-grade agents and multi-agent pipelines using the Anthropic Agent SDK, Claude Code, AWS Bedrock, and MCP — not wrapper frameworks.
  • Own the full agent lifecycle: scoping → prototyping → eval → deploy → monitor → iterate.
  • Integrate agentic workflows into internal and product systems via APIs, databases, webhooks, Slack, and email.

Reliability, Observability, Cost

  • Build first-class observability across apps and infra: OpenTelemetry, Prometheus, plus LLM-specific tracing (Langfuse or equivalent), token/cost metrics, and eval pipelines.
  • Define SLOs/SLIs and error budgets for AI services - latency, model fallback chains, eval regression gates, agent success rates. Lead incident readiness, response, and post-mortems.
  • Drive FinOps: model routing by cost, cache hit rates, batch vs. realtime tradeoffs, budget alarms, per-team chargeback visibility.
  • Implement guardrails: prompt-injection defenses, PII redaction, model allowlists, human-in-the-loop checkpoints, audit trails.

Org Impact

  • Identify high-leverage workflows across the organization and translate them into scalable agentic automations.
  • Partner with R&D, Delivery, security, and external vendors to deliver platform capabilities.
Requirements

Requirements (must-have)

  • 3-5 years in software engineering, shipping and operating production-grade systems.
  • 2+ years hands-on AWS, Kubernetes, and Terraform in production — not familiarity, ownership.
  • 1-2 years hands-on building and deploying LLM-powered or agentic systems in production.
  • Proficiency in Python: async patterns, REST APIs, cloud-native architecture.
  • Production experience with native agentic SDKs (Anthropic Agent SDK, Claude Code) and MCP - tool-calling patterns, server configuration, memory systems, vector DBs.
  • Hands-on AWS Bedrock for model access, IAM-based auth, and enterprise deployment patterns.
  • Production CI/CD ownership (GitHub Actions, Argo CD, or equivalent) and observability stack experience (OpenTelemetry + Prometheus, plus LLM tracing).
  • Proven ownership: design → implement → release → operate → improve, independently and within a team.
  • Strong debugging instincts across multi-step agent chains and distributed infrastructure.
  • Clear written and verbal communication in English; comfortable with internal and external stakeholders.
  • Startup mindset: move fast, own decisions end-to-end, comfortable with ambiguity.

Nice to Have

  • Background in trust & safety, content moderation, or compliance-sensitive environments.
  • FinOps experience at scale (cost attribution, budget enforcement, optimization playbooks).
  • Experience building lightweight internal dashboards or UI layers for agentic workflows.
  • LLM evaluation framework experience (Braintrust, Langfuse evals, custom harnesses).
About Alice

Alice is a trust, safety, and security company built for the AI era. We safeguard the communicative technologies people use to create, collaborate, and interact—whether with each other or with machines.

In a world where AI has fundamentally changed the nature of risk, Alice provides end-to-end coverage across the entire AI lifecycle. We support frontier model labs, enterprises, and UGC platforms with a comprehensive suite of solutions: from model hardening evaluations and pre-deployment red-teaming to runtime guardrails and ongoing drift detection.

Skills Required

  • 3-5 years in software engineering, shipping and operating production-grade systems
  • 2+ years hands-on AWS, Kubernetes, and Terraform in production
  • 1-2 years hands-on building and deploying LLM-powered or agentic systems in production
  • Proficiency in Python: async patterns, REST APIs, cloud-native architecture
  • Production experience with native agentic SDKs (Anthropic Agent SDK, Claude Code) and MCP
  • Hands-on AWS Bedrock for model access, IAM-based auth, and enterprise deployment patterns
  • Production CI/CD ownership (GitHub Actions, Argo CD, or equivalent) and observability stack experience (OpenTelemetry + Prometheus, plus LLM tracing)
  • Proven ownership: design -> implement -> release -> operate -> improve, independently and within a team
  • Strong debugging instincts across multi-step agent chains and distributed infrastructure
  • Clear written and verbal communication in English; comfortable with internal and external stakeholders
  • Startup mindset: move fast, own decisions end-to-end, comfortable with ambiguity
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
New York, New York
413 Employees

What We Do

Alice is a trust, safety, and security company built for the AI era. We safeguard the communicative technologies people use to create, collaborate, and interact - whether with each other or with machines. In a world where AI has fundamentally changed the nature of risk, Alice provides end-to-end coverage across the entire AI lifecycle. We support frontier model labs, enterprises, and UGC platforms with a comprehensive suite of solutions: from model hardening evaluations and pre-deployment red-teaming to runtime guardrails and ongoing drift detection. Alice represents the next chapter of our growth and the natural evolution of ActiveFence, our industry-leading solution for UGC safety, as we expand our mission to secure the future of AI. Advance unafraid: alice.io

Similar Jobs

Tapestry - Coach and Kate Spade Logo Tapestry - Coach and Kate Spade

Sr. Analyst, Costing

eCommerce • Fashion • Retail • Sales • Wearables • Design
Remote or Hybrid
Haiphong, VNM
16000 Employees

Tapestry - Coach and Kate Spade Logo Tapestry - Coach and Kate Spade

Quality Auditor, Footwear

eCommerce • Fashion • Retail • Sales • Wearables • Design
Remote or Hybrid
Haiphong, VNM
16000 Employees

Tapestry - Coach and Kate Spade Logo Tapestry - Coach and Kate Spade

Fitting Model (Female US6B)

eCommerce • Fashion • Retail • Sales • Wearables • Design
Remote or Hybrid
Haiphong, VNM
16000 Employees

Tapestry - Coach and Kate Spade Logo Tapestry - Coach and Kate Spade

Developer, Product Development (Footwear)

eCommerce • Fashion • Retail • Sales • Wearables • Design
Remote or Hybrid
Haiphong, VNM
16000 Employees

Similar Companies Hiring

Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
LTX Thumbnail
Conversational AI • Generative AI
Jerusalem, Israel
360 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account