The Role
The Staff Engineer, Machine Learning Operations will lead technical efforts in AI platform architecture and CI/CD strategies, ensuring model reliability and governance.
Summary Generated by Built In
Job Description:
The Staff Engineer, Machine Learning Operations will provide technical leadership for our AI platform, define architecture and standards for training, evaluation, and high-scale, low-latency inference of models and AI agents. This role will be responsible to develop and implement strategy for CI/CD, governance, and reliability across multiple AI models, partnering with security, compliance, and leadership to deliver resilient, cost-effective AI. Aside from the core responsibilities, Machine Learning Operations Engineers will also have responsibilities shared with other engineering functions.- Establish the technical vision for end-to-end ML-AIOps (from data to model/agent to product integration).
- Design and evolve multi-region, multi-tenant inference/training platforms with strong isolation.
- Design and Implement CI/CD strategy for models/agents/data pipelines (policy gates, canary/rollbacks, approvals).
- Institutionalize model/agent monitoring (quality, safety, drift) and business KPIs; sponsor continuous evaluations.
- Lead major reliability programs (capacity planning, disaster recovery, chaos testing, incident management).
- Establish and implement governance methodologies for datasets, prompts, models, and agents (lineage, approvals, etc.).
- Collaborate on security architecture with security teams (zero-trust, key management, vaults, secrets rotation, audit).
- Evaluate and integrate platforms/vendors; influence build-vs-buy; manage technical debt and roadmap.
- Mentor/prioritize other engineers; build a culture of documentation, runbooks, and post-incident learning.
- Perform other duties that support the overall objective of the position.
- Bachelor’s degree in Computer Science, Information Technology, Electronics/Electrical Engineering, or a related field.
- Or, any combination of education and experience which would provide the required qualifications for the position.
- 5-8 years of hands-on experience in MLOps, DevOps, or related roles involving operation of an AI/ML platform at-scale with 10 – 12+ years of experience in overall IT experience.
- IaC with Terraform at an organizational scale and strong experience in Unix based environments.
- Expert with Continerization and orchestration (Docker/Kubernetes) and cloud, including networking, security, and autoscaling.
- Strong AWS experience is expected.
- Experience in building CI/CD pipelines using tools like BitBucket Pipelines, AWS Code Pipelines or similar.
- Experience with mature observability stacks (e.g. DataDog/Dynatrace). Experience with LLM observability frameworks is a plus.
- Deep experience with operationalizing ML/AI models. Experience with LLMs or AI agents is a plus.
- Knowledge of: Familiarity with database technologies and data pipelines (Data Lakes, Lakehouse, Warehouse, NoSQL, ETL/ELT processes). Solid understanding of model monitoring, logging, and debugging tools. Strong command of platform SRE practices, and cost governance. Familiarity with feature stores, lakehouse patterns, distributed computing systems (Spark) and model versioning systems (MLFlow).
- Skill in: Strong problem-solving skills and a detail-oriented mindset. Excellent communication skills.
- Ability to: Excellent collaboration ability. Ability to have a clear view of complete systems and the ability to understand and work on different components as and when required.
NextGen Healthcare is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Top Skills
AWS
Aws Code Pipelines
Bitbucket Pipelines
Datadog
DevOps
Docker
Dynatrace
Kubernetes
Mlflow
Mlops
Spark
Terraform
Unix
Am I A Good Fit?
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.
Success! Refresh the page to see how your skills align with this role.
The Company
What We Do
NextGen Healthcare is on a relentless quest to improve the lives of those who practice medicine and those they care for. We provide tailored solutions to fit the precise needs of ambulatory practices, as they strive to reach the quadruple aim while navigating the journey of value-based care. The result? Healthier patients and happier providers.








