Senior AI DevOps / LLMOps

Posted 6 Hours Ago
Be an Early Applicant
6 Locations
Remote
Senior level
Agency
The Role
Design and implement CI/CD and IaC for LLMs, automate model and dataset versioning, provision GPU/TPU infrastructure, enable safe experiment gates and progressive delivery, and build monitoring/observability and feedback loops for production LLM endpoints.
Summary Generated by Built In

At TechBiz Global, we are providing recruitment service to our TOP clients from our portfolio. We are currently seeking an Senior AI DevOps / LLMOps specialist to join one of our clients' teams. If you're looking for an exciting opportunity to grow in a innovative environment, this could be the perfect fit for you.
 

Key Responsibilities

  1. Automation of Build-to-Production

- Design and implement robust CI/CD pipelines tailored for AI, covering model weights,

dataset versioning, and application code.

- Develop specialized workflows for PromptOps, ensuring that system prompts are

version-controlled, tested for regressions, and deployed with the same rigor as traditional

code.

-Automate the deployment of Agentic workflows, managing the complexities of stateful

AI interactions and multi-agent handoffs.

2. AI Infrastructure as Code (IaC)

- Provision and manage high-performance compute environments (GPU clusters, TPU

pods) using Terraform, Pulumi, or Ansible.

- Define and enforce Policy-as-Code for AI endpoints to ensure compliance with security,

cost-usage limits, and data residency requirements.

- Maintain a consistent environment across Hybrid Infrastructure, ensuring seamless

parity between On-Premises development and Cloud production.

3. Safe Experimentation & Controlled Releases

- Architect Progressive Delivery strategies for AI, including Canary releases, Blue-Green

deployments, and Shadowing (where new models run in parallel with production to

compare outputs).

- Build “Evaluation-in-the-Loop” gates within the pipeline to automatically test for bias,

hallucination, and performance degradation before a release.

- Implement A/B testing frameworks specifically designed for LLM outputs and agentic

behavior.

4. Monitoring & Observability

- Establish deep observability into Inference Endpoints, tracking metrics like tokens-per-

second, latency, and drift in model accuracy.

-Integrate feedback loops that capture production “edge cases” to feed back into the

training and fine-tuning pipelines.

Job requirements

Must-Have Technical Skills:

-Orchestration: Advanced Kubernetes (K8s) skills, specifically with KubeFlow, Ray, or

NVIDIA Triton.

-CI/CD & IaC: Expertise in GitHub Actions/GitLab CI, and Terraform or Pulumi.

- AI Tooling: Experience with Weights & Biases, MLflow, LangSmith, or Arize

Phoenix.

-Hardware: Understanding of GPU virtualization, CUDA drivers, and on-premises

hardware management.
-Security: Familiarity with Open Policy Agent (OPA) and secret management (Vault).
 

Experience:

- 10+ years in DevOps, SRE, or Cloud Engineering.

- 2+ years of hands-on experience in MLOps or LLMOps, specifically moving LLMs

from notebook to production.

-Proven experience managing Hybrid Cloud environments (e.g., AWS/Azure + Private

Data Center).

Skills Required

  • Advanced Kubernetes (K8s) skills, specifically with KubeFlow, Ray, or NVIDIA Triton
  • Expertise in CI/CD (GitHub Actions or GitLab CI) and IaC (Terraform or Pulumi)
  • Experience with AI tooling such as Weights & Biases, MLflow, LangSmith, or Arize Phoenix
  • Understanding of GPU virtualization, CUDA drivers, and on-premises hardware management
  • Familiarity with Open Policy Agent (OPA) and secret management (Vault)
  • 10+ years in DevOps, SRE, or Cloud Engineering
  • 2+ years hands-on MLOps or LLMOps experience moving models from notebook to production
  • Proven experience managing Hybrid Cloud environments (AWS/Azure + Private Data Center)
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
20 Employees

Similar Jobs

Circle (circle.so) Logo Circle (circle.so)

Lead Product Designer

Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
Easy Apply
Remote
31 Locations
250 Employees
140K-170K Annually

Smartling Logo Smartling

Don't see the role you're looking for currently available? Apply here.

Artificial Intelligence • Cloud • Information Technology • Machine Learning • Natural Language Processing • Software
Easy Apply
Remote
27 Locations
117 Employees

ServiceNow Logo ServiceNow

Marketing Associate

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Milan, ITA
29000 Employees

Coinbase Logo Coinbase

Controller

Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Easy Apply
Remote
26 Locations
4700 Employees
79K-88K Annually

Similar Companies Hiring

Caxy Thumbnail
Software • Mobile • Enterprise Web • Artificial Intelligence • Agency
Chicago, IL
45 Employees
Digible Thumbnail
Social Media • PropTech • Marketing Tech • Digital Media • Artificial Intelligence • Agency • AdTech
PH
145 Employees
Fora Thumbnail
Agency • On-Demand • Professional Services • Sales • Software • Travel • Hospitality
New York, NY
200 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account