Senior AI Infrastructure & Platform Operations Engineer

Posted Yesterday
Be an Early Applicant
Hiring Remotely in Poznań, Województwo wielkopolskie, POL
In-Office or Remote
Senior level
Software
The Role
Lead operations for large-scale AI GPU infrastructure and Kubernetes platforms. Own incident resolution, root cause analysis, observability, capacity and performance analysis, automation, and technical leadership for platform reliability.
Summary Generated by Built In
Company Description

Mirantis helps organizations ship code faster on public and private clouds. The company provides a public cloud experience on any infrastructure from the data center to the edge. With Lens and the Mirantis Cloud Native Platform, Mirantis empowers a new breed of Kubernetes developers by removing infrastructure and operations complexity and providing one cohesive cloud experience for complete app and devops portability, a single pane of glass, and automated full-stack lifecycle management with continuous updates.

Mirantis serves many of the world’s leading enterprises, including Adobe, DocuSign, Liberty Mutual, PayPal, Reliance Jio, Societe Generale, Splunk, and Volkswagen. Learn more at www.mirantis.com.

Job Description

About the Role

We are building a European AI Infrastructure & Platform Operations team responsible for operating large-scale AI infrastructure environments powered by NVIDIA GPUs, high-performance networking, Kubernetes, and next-generation platform technologies.

As a Senior AI Infrastructure & Platform Operations Engineer, you will serve as a technical leader within the operations organization, providing deep expertise across infrastructure, networking, platform operations, and service reliability. You will be responsible for driving operational excellence across complex production environments while acting as a key escalation point for critical incidents and challenging technical issues.

This role combines hands-on technical operations with technical leadership, helping shape operational standards, reliability practices, automation initiatives, and the future evolution of AI-powered operational services through platforms such as k0rdent AI.

Responsibilities:

Technical Operations & Service Reliability

  • Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
  • Act as a senior escalation point for operational teams during critical service-impacting events.
  • Support large-scale NVIDIA GPU infrastructure and high-performance networking environments.
  • Troubleshoot complex Linux, Kubernetes, networking, storage, and hardware-related issues.
  • Analyze platform performance, capacity, stability, and reliability trends to proactively identify risks.
  • Lead root cause analysis activities and drive long-term corrective actions.
  • Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve complex technical challenges.
  • Participate in major incident management and service restoration activities.

Platform Operations & Engineering

  • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
  • Drive improvements in platform reliability, observability, monitoring, and operational processes.
  • Identify opportunities to automate repetitive operational activities and improve operational efficiency.
  • Contribute to operational readiness reviews, infrastructure changes, upgrades, and service introductions.
  • Support the adoption and operation of AI-powered infrastructure services and operational capabilities through k0rdent AI.
  • Evaluate emerging technologies and operational practices to improve service delivery and platform resilience.

Technical Leadership

  • Mentor and support AI Infrastructure & Platform Operations Engineers.
  • Share technical knowledge through documentation, training sessions, and operational reviews.
  • Develop and maintain operational standards, runbooks, troubleshooting guides, and best practices.
  • Help define operational processes, escalation paths, and service reliability standards.
  • Act as a trusted technical advisor during operational planning and service improvement initiatives.

Qualifications

 

Required Skills & Experience:

  • 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or related technical roles.
  • Expert-level Linux administration and troubleshooting skills.
  • Strong networking expertise, including experience diagnosing complex performance, connectivity, and reliability issues.
  • Strong experience operating Kubernetes in production environments.
  • Experience supporting large-scale production infrastructure and distributed systems.
  • Proven experience leading technical investigations and managing complex incidents.
  • Experience performing root cause analysis and driving long-term operational improvements.
  • Strong understanding of observability, monitoring, and service reliability practices.
  • Excellent troubleshooting and analytical skills across multiple infrastructure domains.
  • Strong communication, collaboration, and stakeholder management skills.

Preferred qualifications:

Experience in one or more of the following areas is highly desirable:

  • NVIDIA GPU infrastructure and accelerated computing platforms.
  • InfiniBand networking and NVIDIA UFM.
  • AI infrastructure environments.
  • HPC environments.
  • Platform Engineering or Site Reliability Engineering (SRE).
  • Large-scale Kubernetes operations.
  • Infrastructure automation technologies and Infrastructure-as-Code practices.
  • Observability platforms such as Grafana, Prometheus, ELK, or OpenTelemetry.
  • Performance analysis and optimisation of distributed infrastructure platforms.
  • Technical leadership, mentoring, or team lead responsibilities.

Additional Information

We offer:

  • Operate some of the most advanced AI infrastructure environments in production today.
  • Work with the latest NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments.
  • Help define operational standards and reliability practices for next-generation AI infrastructure services.
  • Influence the adoption of AI-powered operational capabilities through k0rdent AI.
  • Work alongside highly skilled engineers solving complex infrastructure and platform challenges at scale.
  • Join a growing organisation investing heavily in AI infrastructure, platform services, and operational innovation.

#LI-Remote

We are a Leader for Container Management in G2 (#2 after AWS)!

Skills Required

  • 7+ years in infrastructure operations, platform operations, SRE, network or datacenter operations
  • Expert-level Linux administration and troubleshooting
  • Strong networking expertise diagnosing complex performance, connectivity, and reliability issues
  • Experience operating Kubernetes in production environments
  • Experience supporting large-scale production infrastructure and distributed systems
  • Proven experience leading technical investigations and managing complex incidents
  • Experience performing root cause analysis and driving long-term operational improvements
  • Strong understanding of observability, monitoring, and service reliability practices
  • Excellent troubleshooting and analytical skills across multiple infrastructure domains
  • Strong communication, collaboration, and stakeholder management skills
  • NVIDIA GPU infrastructure and accelerated computing platforms
  • InfiniBand networking and NVIDIA UFM
  • Experience with AI infrastructure or HPC environments
  • Platform Engineering or Site Reliability Engineering (SRE) experience
  • Infrastructure automation technologies and Infrastructure-as-Code practices
  • Observability platforms such as Grafana, Prometheus, ELK, or OpenTelemetry
  • Performance analysis and optimisation of distributed infrastructure platforms
  • Technical leadership, mentoring, or team lead responsibilities
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Campbell, CA
729 Employees
Year Founded: 1999

What We Do

We are dedicated to helping organizations increase developer productivity and ship code faster on public and private clouds. We provide a ZeroOps experience to remove the stress of managing cloud native infrastructure by combining software and automation tools with our cloud native expertise to deliver the industry's leading secure cloud platforms. Our capabilities allow us to provide a secure and reliable cloud native platform that includes validated FIPS-140-2 Encryption and DISA STIG ready capabilities. Who do we serve? We serve a wide range of industries, building on our extensive customer experience to provide distinct value in specific verticals including Financial Services, Government & Education, Healthcare, Manufacturing, and Telecommunications. Mirantis serves many of the world’s leading enterprises, including Adobe, DocuSign, Inmarsat, PayPal, Reliance Jio, Societe Generale, Splunk, and S&P Global. Learn more at www.mirantis.com.

Similar Jobs

Easy Apply
Remote
37 Locations
55 Employees
140K-178K Annually

CodePath.org Logo CodePath.org

Senior Product Designer

Edtech • Social Impact
Easy Apply
Remote
37 Locations
55 Employees
148K-190K Annually

Capco Logo Capco

Program Test Manager - Cards (Polish is Mandatory) (She/He/They)

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Remote or Hybrid
Poland
6000 Employees

Capco Logo Capco

Support Engineer

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Remote or Hybrid
Poland
6000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account