Senior MLOps Engineer

Posted Yesterday
Be an Early Applicant
Hiring Remotely in Brazil
Remote
Senior level
Fintech • Financial Services
The Role
Lead production ML operations: build repeatable deployment pipelines, harden and own low-latency ML serving APIs, implement monitoring/alerting/drift detection, ensure reliability and security, instrument models and agents, and mentor engineers while shaping the ML infrastructure roadmap.
Summary Generated by Built In
About the Role

We're hiring a Senior MLOps Engineer to be the data team's owner of production ML operations. You'll build the pipelines that take models from prototype to production, own the low-latency serving API behind our Next Best Action (NBA) engine, and stand up the monitoring, alerting, and reliability layer that keeps NBA models — and the LLM agents that consume them — healthy in production. This is a builder's role at a builder's moment: NBA is going live, the production ML platform is being shaped now, and you'll define how Clutch ships and operates AI for years to come. When there isn't active MLOps work, you'll also contribute to data engineering and machine learning work across the team.

About the Team

The Data team today is five people: one data scientist, two data engineers, one data analyst, and one product manager. We're small, ambitious, and shipping fast — ML models heading to production, a serving API being built, and AI agents in active development. You'll be the senior MLOps voice inside the team and the operational bridge to HAL, the platform team that runs Clutch's agent runtime. Expect tight feedback loops, real autonomy, and a team that values pragmatism over purity.

What You'll Do

Within 3 months, you will:

  • Take ownership of the ML serving API that serves NBA recommendations, partnering with the data engineer who's been building it, and harden it for low-latency production traffic

  • Build the first repeatable deployment pipeline: model artifact → versioned, deployable, rollback-able production service, with infrastructure defined as code

  • Stand up the monitoring foundation: latency/error/drift dashboards, alerting, and audit/trace visibility across models and agents

  • Build a working relationship with HAL and become the data team's go-to on ML serving and reliability decisions

Within 6 months, you will:

  • Be the primary owner (with data engineer support) of the ML serving platform and deployment pipelines for NBA and our ML models

  • Have at least one production model and one production agent fully instrumented — versioning, monitoring, alerting, and multi-tenant gating in place

  • Define the data team's playbook for shipping a new ML model to production, end-to-end

  • Drive architectural decisions across APIs, processing pipelines, distributed compute, storage, search, observability, cloud infrastructure, and model-serving workflows

  • Mentor the data engineers on MLOps patterns so they can confidently support and extend the systems you own

Within 9 months, you will:

  • Operate as the technical lead within the data team for NBA production ML operations — the person other teams come to when they want to understand how Clutch ships and runs ML reliably

  • Have measurably improved cost and latency

  • Be shaping the data team's roadmap for the next generation of ML infrastructure, in partnership with the PM and data scientist

  • Help us decide what to hire next as the team scales

What You'll Bring

Required

  • 8+ years of experience in software, data, or ML engineering, with 4–5+ years running ML systems in production — you've taken models from prototype to production and own what happens after deploy

  • Strong Python — most of the work (serving API, pipelines, tooling, data pipelines) is in Python, and you're comfortable in production codebases, not just notebooks. Some TypeScript is involved for integration with our agent runtime — you don't need to be an expert, comfort with a second language is enough

  • CI/CD & deployment discipline. You build training and deploy pipelines that take a model artifact to a versioned, deployable, rollback-able production service, with automated testing and reproducible builds. You've implemented CI/CD for ML and built and maintained CI/CD pipelines (GitHub Actions, Bamboo, GitLab CI, or similar)

  • Infrastructure as code. You manage cloud infrastructure (AWS Lambda, ECS) with Terraform or equivalent — no click-ops, everything reviewable and reproducible

  • Monitoring & observability discipline. You instrument serving systems for latency, error rates, drift, and cost; you read audit rows and distributed traces; you set up alerting so regressions are caught before users feel them. You treat monitoring as a first-class deliverable, not an afterthought

  • Reliability rigor. You design for failure: structured error handling, graceful degradation, rollback paths, and runbooks. You have a story about a production incident you handled and how you hardened the system afterward

  • Experience building and operating low-latency production APIs (FastAPI, BentoML, or equivalent), with opinions on serving, batching, and caching

  • Comfortable in AWS (Lambda especially), containers (Docker), and GitHub-based workflows

  • Security & governance. You ensure security and governance across systems: IAM, KMS, access policies, and Secrets Manager/SSM

  • DevOps / infrastructure knowledge, plus data manipulation and feature engineering

  • Solid understanding of ML concepts: models, pipelines, metrics, and supervised/unsupervised learning

  • Integrate and optimize AI/ML services with the company's other systems

  • You use AI tooling actively in your engineering workflow — not as a novelty, but as a default. You'll be expected to demonstrate this during the technical evaluation

  • Databricks, PySpark

Desired

  • Production agent observability: reading audit rows, distributed traces, per-tool latency and error metrics

  • Cost and latency tradeoff intuition in production ML/agent systems — has measurably reduced per-inference or per-conversation cost or P95 latency on a live system

  • Familiarity with an agent runtime framework (Vercel AI SDK, LangChain, LlamaIndex, or equivalent) from a serving/operations angle

  • Multi-tenant agent gating experience

  • Agentic AI operations experience: Agent Ops, LLM Ops

  • Prior SaaS and/or FinTech experience

What’s In It For You?

  • Remote Flexibility: Enjoy the freedom of remote work from anywhere, balancing life and career seamlessly.

  • Unforgettable Off-Sites: Twice a year, bond with colleagues in exciting destinations, fostering teamwork and fresh ideas.

  • Paid Time Off and National Holidays: Enjoy 20 PTO days yearly and the National Holidays for relaxation and rejuvenation.

  • Stock Options: Joining us means having a stake in our success, so you'll receive stock options as part of your compensation package.

  • Home Office Setup: Create your ideal workspace with a dedicated budget for home office essentials.

  • Work Trip Budget: Grow personally and professionally with a budget for work-related trips and co-working.

About Us

Clutch is a revolutionary vertical SaaS company, proudly backed by Andreessen Horowitz (A16z), aimed at revolutionizing the way Credit Unions engage and change the lives of their members. As a champion of financial well-being, we address the urgent need for affordable lending solutions in an era where the average American grapples with over $155,000 in household debt. Unlike traditional financial institutions, Clutch develops software to turn Credit Unions into FinTech lenders and leverage their balance sheets to responsibly lend to over 130M Americans. Our mission extends beyond mere financial transactions; we strive to fundamentally enhance the way credit unions interact with their members. By integrating cutting-edge technologies and user-centric designs, we help credit unions provide seamless digital experiences that are on par with leading tech companies. This approach not only preserves but revitalizes the longstanding tradition of community and member-focused service inherent to credit unions.

Please note: This position is offered on a contractor basis. Applicants must have the necessary documentation and authorization to work in the country where the job is located. Clutch cannot provide sponsorship or assist with obtaining work permits for this role.

Skills Required

  • 8+ years of experience in software, data, or ML engineering, with 4-5+ years running ML systems in production
  • Strong Python experience and production codebase comfort
  • Comfortable with TypeScript for integrations (not required to be expert)
  • CI/CD and deployment discipline; implemented CI/CD for ML (GitHub Actions, Bamboo, GitLab CI, or similar)
  • Infrastructure as code experience (Terraform) and managing cloud infrastructure (AWS Lambda, ECS)
  • Monitoring and observability discipline: instrument latency, error, drift; set up alerting and traces
  • Reliability rigor: design for failure, rollback paths, runbooks, incident hardening experience
  • Experience building and operating low-latency production APIs (FastAPI, BentoML, or equivalent)
  • Comfortable with containers (Docker) and GitHub-based workflows
  • Security and governance experience: IAM, KMS, access policies, AWS Secrets Manager / SSM
  • DevOps/infrastructure knowledge, plus data manipulation and feature engineering skills
  • Solid understanding of ML concepts: models, pipelines, metrics, supervised/unsupervised learning
  • Regular use of AI tooling in engineering workflow and ability to demonstrate during technical evaluation
  • Experience with Databricks and PySpark
  • Must have documentation and authorization to work in the country where the job is located; employer cannot sponsor visas
  • Production agent observability (audit rows, distributed traces, per-tool metrics)
  • Experience reducing production cost or P95 latency for ML/agent systems
  • Familiarity with agent runtime frameworks (Vercel AI SDK, LangChain, LlamaIndex) from an operations/serving perspective
  • Multi-tenant agent gating experience
  • Agentic AI operations experience (Agent Ops, LLM Ops)
  • Prior SaaS and/or FinTech experience
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
92 Employees
Year Founded: 2020

What We Do

We are the Credit Union’s one-stop-shop for digital deposit account opening, frictionless loan applications (POS), perpetual pre-approval and recapture efforts. The platform successfully drives direct member, loan & deposit growth, boosts profitability, increases share of wallet and reduces the cost to produce loans through efficiency gains. Our founders enjoy nothing more than speaking to and learning from Credit Union leaders. Schedule a time to meet with Chris and Nicky and share your pains, goals and ambitions.

Similar Jobs

Wellhub Logo Wellhub

Senior MLOps Engineer

Fitness • Healthtech
Remote
Brazil
2200 Employees

Factored Logo Factored

Senior MLOps Engineer

Artificial Intelligence • Machine Learning • Analytics
Remote
11 Locations
166 Employees

Mastercard Logo Mastercard

Manager, Product Management

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Remote or Hybrid
São Paulo, BRA
38800 Employees

Mastercard Logo Mastercard

Senior Analyst, Customer Onboarding and Partnership

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Remote or Hybrid
São Paulo, BRA
38800 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account