Senior Machine Learning Engineer - I (MLOps/LLMOps)

Reposted 9 Days Ago
Easy Apply
Be an Early Applicant
Headquarters, AZ
In-Office
158K-185K Annually
Senior level
Software
The Role
Design and implement MLOps systems for ML and LLM services, ensure reliability, and collaborate across teams to optimize AI solutions.
Summary Generated by Built In
Senior Machine Learning Engineer - I  (MLOps/LLMOps)

As a Senior Machine Learning Engineer - MLOps/LLMOps, you will design, build, and scale production-grade infrastructure and platforms that enable the full lifecycle of ML and LLM systems. You'll architect robust pipelines for model training, evaluation, deployment, and monitoring while ensuring reliability, observability, and efficiency at scale. This role collaborates closely with ML Engineers, Data Scientists, and Product teams to operationalize AI/ML solutions from prototype to production. Remote candidates will be considered. Ability to participate with fellow ML staff in-office at the company HQ in Redwood City, CA when needed is preferred. 

ResponsibilitiesPlatform & Infrastructure
  • Design and implement scalable MLOps/LLMOps platforms supporting the full ML lifecycle: data versioning, model training, evaluation, deployment, and monitoring
  • Build and maintain CI/CD pipelines for ML models and LLM applications with automated testing, validation, and rollback capabilities
  • Develop infrastructure-as-code (IaC) for reproducible, version-controlled ML environments
  • Architect model serving infrastructure with auto-scaling, A/B testing, and canary deployment capabilities
LLM Operations
  • Build platforms for LLM fine-tuning, prompt management, and experimentation at scale
  • Implement evaluation frameworks for LLM performance, quality, safety, and cost optimization
  • Design and deploy enterprise-grade AI agents and copilots with robust monitoring and guardrails
  • Establish LLM observability: token usage tracking, latency monitoring, prompt/response logging, and cost attribution
Operational Excellence
  • Own uptime, reliability, and performance of ML/LLM services (SLIs/SLOs)
  • Implement comprehensive monitoring, alerting, and incident response for ML systems
  • Participate in on-call rotations and drive post-incident reviews to improve system resilience
  • Build automation and tooling to reduce toil and accelerate ML development velocity
Collaboration & Leadership
  • Partner with ML Engineers and Data Scientists to translate research into production-ready systems
  • Collaborate with platform and infrastructure teams on cloud architecture and resource optimization
  • Mentor team members on MLOps best practices, production ML patterns, and operational excellence
  • Drive technical decisions on tooling, frameworks, and architectural patterns
Required Qualifications and Skills
  • Education: B.S./M.S./Ph.D. in Computer Science, Engineering, or related technical field
  • Experience: 4+ years of software engineering experience with 2+ years focused on MLOps/LLMOps
  • MLOps Expertise:
    • Production experience with ML model serving frameworks (e.g., TensorFlow Serving, TorchServe, Triton)
    • Hands-on with ML experiment tracking and model registry tools (MLflow, Weights & Biases, Kubeflow)
    • Proficiency in workflow orchestration (Airflow, Prefect, Kubeflow Pipelines, Metaflow)
  • LLMOps Expertise:
    • Experience with LLM deployment, fine-tuning, and evaluation frameworks (e.g., vLLM, LangChain, LlamaIndex)
    • Knowledge of prompt engineering, RAG architectures, and LLM application patterns
    • Familiarity with LLM observability tools (e.g., LangSmith, Arize, WhyLabs)
  • Cloud & Infrastructure:
    • Strong experience with major cloud providers (AWS, GCP, or Azure) and ML-specific services (SageMaker, Vertex AI, Azure ML, Bedrock)
    • Proficiency in containerization (Docker, Kubernetes) and infrastructure-as-code (Terraform, CloudFormation, Pulumi)
    • Experience with microservices architecture and API development (REST, gRPC)
  • Software Engineering:
    • Strong programming skills in Python, terraform and Helm; familiarity with Go, Java, or Rust is a plus
    • Deep understanding of CI/CD practices and tools (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
    • Experience with monitoring and observability stacks (Prometheus, Grafana, DataDog, ELK)
  • Operational Excellence:
    • Track record of managing production systems with defined SLIs/SLOs
    • Experience with on-call rotations, incident management, and reliability engineering practices
Desired Qualifications and Skills
  • Experience building internal ML platforms or developer tooling used by multiple teams
  • Hands-on with distributed training frameworks (Ray, Horovod, DeepSpeed)
  • Knowledge of model optimization techniques (quantization, distillation, pruning)
  • Familiarity with feature stores (Feast, Tecton) and data versioning tools (DVC, LakeFS)
  • Understanding of ML security best practices, model governance, and compliance requirements
  • Experience with cost optimization and resource management for large-scale ML workloads
  • Contributions to open-source MLOps/LLMOps projects
  • Background in applied ML or data science with practical model development experience
About Us

Sumo Logic, Inc. helps make the digital world secure, fast, and reliable by unifying critical security and operational data through its Intelligent Operations Platform. Built to address the increasing complexity of modern cybersecurity and cloud operations challenges, we empower digital teams to move from reaction to readiness—combining agentic AI-powered SIEM and log analytics into a single platform to detect, investigate, and resolve modern challenges. Customers around the world rely on Sumo Logic for trusted insights to protect against security threats, ensure reliability, and gain powerful insights into their digital environments. For more information, visit www.sumologic.com.

Sumo Logic Privacy Policy. Employees will be responsible for complying with applicable federal privacy laws and regulations, as well as organizational policies related to data protection.

Compensation varies based on a variety of factors which include (but aren’t limited to) role level, skills and competencies, qualifications, knowledge, location, and experience. In addition to base pay, certain roles are eligible to participate in our bonus or commission plans, as well as our benefits offerings, and equity awards. 

Must be authorized to work in the United States at time of hire and for duration of employment. At this time, we are not able to offer nonimmigrant visa sponsorship for this position.

Top Skills

Airflow
AWS
Azure
CloudFormation
Datadog
Docker
Elk
GCP
Grafana
Grpc
Helm
Kubeflow
Kubernetes
Langchain
Llamaindex
Mlflow
Prefect
Prometheus
Pulumi
Python
Rest
Tensorflow Serving
Terraform
Terraform
Torchserve
Triton
Vllm
Weights & Biases
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Redwood City, CA
913 Employees
Year Founded: 2010

What We Do

Sumo Logic is the pioneer in continuous intelligence, a new category of software, which enables organizations of all sizes to address the data challenges and opportunities presented by digital transformation, modern applications, and cloud computing. The Sumo Logic Continuous Intelligence Platform™ automates the collection, ingestion, and analysis of application, infrastructure, security, and IoT data to derive actionable insights within seconds. More than 2,100 customers around the world rely on Sumo Logic to build, run, and secure their modern applications and cloud infrastructures. Sumo Logic delivers its platform as a true, multi-tenant SaaS architecture, across multiple use-cases, enabling businesses to thrive in the Intelligence Economy.

Similar Jobs

Dynatrace Logo Dynatrace

Director, Web Experience and Programs

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Big Data Analytics • Automation
Remote or Hybrid
United States
5200 Employees
166K-250K Annually

YCharts Logo YCharts

Director of Strategic Partnerships

Cloud • Fintech • Software • Financial Services
Remote or Hybrid
United States
142 Employees
150K-225K Annually

YCharts Logo YCharts

Enterprise Account Executive

Cloud • Fintech • Software • Financial Services
Remote or Hybrid
United States
142 Employees
100K-260K Annually

HiBob Logo HiBob

Operations Specialist

HR Tech • Information Technology • Professional Services • Sales • Software
Remote or Hybrid
United States
1350 Employees
85K-110K Annually

Similar Companies Hiring

Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees
Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account