What You'll Accomplish
- Expand, mature, and optimize our ML platform built around cutting edge tooling like Ray, MLFlow, Argo, and Kubernetes to support traditional and deep learning ML models
- Build and mature capabilities to support CPU / GPU clusters, model performance monitoring, drift detection, automated roll-outs, and improved developer experience
- Build, operate, and maintain a low-latency, high volume ML serving layer covering both online and batch inference use cases
- Orchestrate Kubernetes and ML training / inference infrastructure exposed as an ML platform
- Expose and manage environments, interfaces, and workflows to enable ML engineers to develop, build, and test ML models and services
- Close the latency gap on model inference to online, real-time model serving
- Develop automation workflows to improve team efficiency and ML stability
- Analyze and improve efficiency, scalability, and stability of various system resources
- Partner with other teams and business stakeholders to deliver business initiatives
- Help onboard new team members, provide mentorship and enable successful ramp up on your team's code bases
About you
- You have been working in the areas of MLOps / Platform Engineering / DevOps / Infrastructure for 3+ years, and have an understanding of gold standard practices and best in class tooling for ML
- Your passion is exposing platform capabilities through interfaces that enable high performance ML practices, rather than designing ML experiments (this team does not directly develop ML models)
- You understand the key differences between online and offline ML inferences and can voice the critical elements to be successful with each to meet business needs
- You have experience building infrastructure for an ML platform and managing CPU and GPU compute
- You have a background in software development and are passionate about bringing that experience to bear on the world of ML infrastructure
- You have experience with Infrastructure as Code using Terraform and can’t imagine a world without it
- You understand the importance of CI/CD in building high-performing teams and have worked with tools like Jenkins, CircleCI, Argo Workflows, and ArgoCD
- You are passionate about observability and worked with tools such as Splunk, Nagios, Sensu, Datadog, New Relic
- You are very familiar with containers and container orchestration and have direct experience with vanilla Docker as well as Kubernetes as both a user and as an administrator
Your Expertise
- You have been working in the areas of ML Platform / MLOps / Platform Engineering / DevOps / Infrastructure for 3+ years, and have an understanding of gold standard practices and best in class tooling for ML
- Your passion is exposing platform capabilities through interfaces that enable high performance ML practices, rather than designing ML experiments (this team does not directly develop ML models)
- You understand the key differences between online and offline ML inferences and can voice the critical elements to be successful with each to meet business needs
- You have experience building infrastructure for an ML platform and managing CPU and GPU compute
- You have a background in software development and are passionate about bringing that experience to bear on the world of ML infrastructure
- You have experience with Infrastructure as Code using Terraform and can’t imagine a world without itYou understand the importance of CI/CD in building high-performing teams and have worked with tools like Jenkins, CircleCI, Argo Workflows, and ArgoCD
- You are passionate about observability and worked with tools such as Splunk, Nagios, Sensu, Datadog, New RelicYou are very familiar with containers and container orchestration and have direct experience with vanilla Docker as well as Kubernetes as both a user and as an administrator.
What We Use
- Our infrastructure runs primarily in Kubernetes hosted in AWS’s EKS
- Infrastructure tooling includes Istio, Datadog, Terraform, CloudFlare, and Helm
- Our backend is Java / Spring Boot microservices, built with Gradle, coupled with things like DynamoDB, Kinesis, AirFlow, Postgres, Planetscale, and Redis, hosted via AWS
- Our frontend is built with React and TypeScript, and uses best practices like GraphQL, Storybook, Radix UI, Vite, esbuild, and Playwright
- Our automation is driven by custom and open source machine learning models, lots of data and built with Python, Metaflow, HuggingFace 🤗, PyTorch, TensorFlow, and Pandas
Similar Jobs
What We Do
Attentive® is the AI marketing platform for leading brands, designed to optimize message performance through 1:1 SMS and email interactions. Infusing intelligence at every stage of the consumer’s purchasing journey, Attentive empowers businesses to achieve hyper-personalized communication with their customers on a large scale. Leveraging AI-powered tools, a mobile-first approach, two-way conversations, and enterprise-grade technology, Attentive drives billions in online revenue for brands around the globe. Trusted by over 8,000 leading brands such as CB2, Urban Outfitters, GUESS, Dickey’s Barbeque Pit, and Wyndham Resort, Attentive is the go-to solution for delivering powerful commerce experiences for consumers with the brands they love.
To learn more about Attentive or to request a demo, visit www.attentive.com or follow us on LinkedIn, X (formerly Twitter), or Instagram.
Why Work With Us
At Attentive, you'll connect with inspiring, high-caliber people, and be encouraged to take risks, get creative, and think bigger. We're solving big problems for our customers, through our innovative AI solutions, giving employees the opportunity to thrive along the journey. The sky's the limit when it comes to what's possible.
Gallery
.png)







