Site Reliability Engineer (SRE)

Reposted 5 Days Ago
Be an Early Applicant
Boulder, CO
In-Office
135K-180K Annually
Mid level
Artificial Intelligence • Software
The Role
The Site Reliability Engineer will manage infrastructure, drive enterprise deployments, and ensure the reliability of the Freeplay platform by working closely with customers and optimizing cloud architectures.
Summary Generated by Built In

The Opportunity

We're hiring an experienced Site Reliability Engineer to own the reliability of the Freeplay platform and drive success for our most advanced enterprise customers. In this role, you will bridge the gap between core infrastructure engineering and high-stakes customer deployments. You won’t just be maintaining our internal SaaS environment; you will be the technical expert guiding Fortune 100 engineering teams as they deploy Freeplay into their own private clouds.

This is an exciting chance to join a fast-growing startup with a front-row seat to how AI products are being built at some of the largest and most innovative companies in the world. You’ll be hands-on with customers, learning about cutting-edge AI architectures while ensuring our platform runs flawlessly in their diverse and complex environments.

What's Freeplay?

Freeplay is the end-to-end platform for software teams to ship great AI products. We give product development teams the power to test, evaluate, monitor & optimize AI in production. Our customers use Freeplay to build better LLM features, chatbots, and agents. Today we serve leading software companies from growing startups to Fortune 100 companies.

Your Mission

Build the infrastructure that powers Freeplay and ensure successful deployments for our enterprise customers.

  • Partner with Enterprise Customers: Act as a key technical contact for our "Bring Your Own Cloud" (BYOC) deployments. You will jump on calls with customer engineering teams to guide them through installation, debug configuration issues in their VPCs, and ensure they are successful running Freeplay.

  • Own the Multi-Cloud Architecture: Help manage and improve our internal production infrastructure across AWS, GCP, and Azure ensuring high availability and seamless networking.

  • Solve the "Shipped Software" Challenge: Drive the engineering efforts to package and distribute Freeplay using tools like Helm, Replicated, and KOTS. You will help ensure our software is portable, installing as reliably in a customer's cloud environment as it does in our SaaS.

  • Master Infrastructure as Code: Drive our Terraform strategy, building modular, reusable, and secure infrastructure definitions that treat operations with the same rigor as application code.

  • Champion Observability: Implement and tune our monitoring stack (Datadog) to provide deep visibility into system health, and help customers implement similar observability for their private instances.

  • Scale Data & Messaging: Manage the stateful components of our stack, including PostgreSQL, Elasticsearch, and NATS JetStream, ensuring data integrity and performance under load.

About You

  • Experience: We are open to candidates ranging from Mid-Level (3+ years) to Senior/Staff (7+ years). We will tailor the scope and responsibilities to your expertise.

  • Customer-facing confidence. You are comfortable interacting directly with external engineering teams. You can troubleshoot a failed deployment while on a Zoom call with a client and explain complex architectural requirements clearly.

  • Production Kubernetes fluency. You are confident managing EKS/GKE/AKS clusters, debugging complex pod failures, managing ingress controllers, and handling autoscaling in production.

  • Deep Terraform expertise. You have experience structuring IaC for scale and have managed multi-environment setups.

  • Database operational experience. You aren't just an infrastructure plumber; you understand how to manage and tune databases (Postgres) and search indices (Elasticsearch) at scale.

  • Security-first thinking. You are familiar with cloud security best practices, including VPC networking, IAM/Workload Identity, and secrets management, and you can explain these concepts to security-conscious enterprise clients.

Bonus Points

  • Experience in a Solutions Engineering or Field Engineering capacity.

  • Experience with Replicated / KOTS or similar tools for packaging enterprise software for on-premise/VPC deployments.

  • Experience operating message queues like NATS, JetStream, or Kafka.

  • Background in AI/ML infrastructure or high-throughput data systems.

Compensation & Benefits

  • Competitive salary commensurate with experience, plus equity package.

  • Medical, dental, and vision insurance.

  • Premium hardware setup (MacBook, monitor, peripherals).

  • Four weeks of Paid Time Off per year (and we encourage you to take it!).

Location

We prefer candidates able to work full-time on-site in Boulder, CO, but we're open to exceptional remote candidates who can visit Boulder every 6 weeks for team collaboration.

Top Skills

AWS
Azure
Datadog
Elasticsearch
GCP
Helm
Kots
Nats Jetstream
Postgres
Replicated
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Boulder, Colorado
14 Employees
Year Founded: 2022

What We Do

A better way to build with LLMs. Bridge the gap between domain experts & developers. Prompt engineering, testing & evaluation tools for your whole team.

Now in private beta.

Similar Jobs

Zeta Global Logo Zeta Global

Senior Site Reliability Engineer

AdTech • Artificial Intelligence • Marketing Tech • Software • Analytics
Easy Apply
Remote or Hybrid
United States
2429 Employees
140K-170K Annually

Milestone Systems Logo Milestone Systems

Site Reliability Engineer

Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
Remote or Hybrid
2 Locations
1500 Employees
160K-180K Annually

Vertafore Logo Vertafore

Senior Site Reliability Engineer

Information Technology • Insurance • Software
Hybrid
Denver, CO, USA
2372 Employees
110K-125K Annually
Easy Apply
In-Office or Remote
3 Locations
32 Employees
124K-206K Annually

Similar Companies Hiring

Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account