About Us
SUSE is a global leader of enterprise open source software. By transforming community innovations into secure, sovereign and AI-ready solutions, SUSE empowers customers to escape vendor lock-in and regain control of their IT destiny. Through industry-leading Linux, Kubernetes, Edge and AI infrastructure solutions, SUSE delivers the flexibility to innovate everywhere—from the data center to multi-cloud and out to the edge. Only SUSE also manages many Linux and Kubernetes distributions. At SUSE, Choice Happens because we prioritize community, interoperability and relentless innovation. Discover how we power mission-critical resilience at www.suse.com.
Agentic AI Solutions Engineer
Job Description
About the Role
This is an hands-on engineering role. You will be equally responsible for building new platform capabilities, keeping the platform operationally healthy, and maintaining the infrastructure-as-code and documentation that underpins it. You will work with one other engineer as a pair: sharing ownership of the full platform, peer-reviewing each other's work, and developing complementary depth across the stack over time.
The platform is in active delivery. You will join at a point where the core infrastructure is running and the next phase of security hardening, automation, and observability is under way. There is meaningful work to deliver from day one.
Key Responsibilities
Build
- Implement new platform capabilities from architectural designs, translating security, governance, and infrastructure requirements into production-grade infrastructure-as-code
- Design and build the platform security and secrets management layer, ensuring all workloads operate with least-privilege credentials and certificates issued through a governed PKI hierarchy
- Implement and enforce security policy across the cluster using admission control, covering workload configuration, image standards, network traffic, and resource constraints
- Build and establish the platform observability stack, providing consistent log aggregation, metrics, distributed tracing, and alerting across all platform components
- Design and implement GitOps delivery automation, ensuring all platform changes flow through version-controlled, auditable pipelines with drift reconciliation
- Build and configure workload autoscaling, ensuring AI workflow workers scale efficiently and cost-effectively in response to demand
- Implement the AI model routing and gateway layer, enabling governed, auditable routing of model traffic with per-consumer rate limiting
Operate
- Own the day-to-day operational health of the platform: monitor for issues, respond to incidents, conduct root-cause analysis, and implement lasting remediation
- Maintain the health of platform data services — database cluster, job queue, and object storage — including backup schedules, failover testing, and capacity management
- Monitor and tune autoscaling and resource configuration as workload patterns evolve, ensuring the platform scales responsively without over-provisioning
- Manage secrets rotation, certificate lifecycle, policy drift detection, and identity configuration as ongoing operational responsibilities
- Participate in planned high-stakes operational procedures — such as secrets infrastructure initialisation and rotation events — applying disciplined, documented execution
Maintain
- Own and evolve the infrastructure-as-code for your areas of the platform; keep all configurations versioned, peer-reviewed, and aligned with the architectural design
- Proactively identify and resolve technical debt — manual processes, undocumented configurations, legacy credential management, and gaps in observability coverage
- Produce and maintain operational runbooks for all platform procedures, ensuring any team member can execute them safely and independently
- Peer-review all platform infrastructure changes produced by your engineering counterpart, providing challenge and quality assurance across the full stack
- Contribute to platform documentation and knowledge-sharing, supporting the wider team's understanding of the platform as it matures
Candidates will need to demonstrate hands-on production delivery experience, not just conceptual familiarity. We expect evidence of real delivery against each of these areas at interview.
- Kubernetes — production cluster operation (RKE2, EKS, GKE, or equivalent); Helm, RBAC design, multi-namespace workload management
- Secrets management — production deployment of a secrets management platform (HashiCorp Vault or equivalent), covering PKI, dynamic credentials, and workload secrets injection
- Policy-as-code — admission control policy authoring and enforcement in production Kubernetes environments (OPA/Rego, Kyverno, or equivalent)
- GitOps — Fleet, ArgoCD, Flux, or equivalent at production scale; declarative drift reconciliation, rollback strategy, multi-environment targeting
- Observability stack — log aggregation, log pipeline design, distributed tracing (OpenTelemetry or equivalent), and metrics dashboards (Prometheus/Grafana or equivalent)
- API gateway engineering — production deployment and operation of an API or AI gateway (Kong, Envoy, or equivalent); rate limiting, plugin/policy authoring, route management
- Linux platform engineering — networking fundamentals, TLS and PKI, CSI storage operations, container runtime
Job
Information TechnologyWhat We Offer
We empower you to be bold, driving your career to create the future you want. We celebrate and reward your achievements.
SUSE is a dynamic environment that is evolving rapidly, thus requiring agility, strong entrepreneurship and an open mind.
This is a compelling opportunity for the right person to join us as we continue to scale and prosper.
If you’re a big thinker, obsessed by execution and thrive in a dynamic environment in which you can tangibly create a lasting legacy, then please apply now!
We give you the freedom to be yourself. You will work in a global community of unique individuals – like you – with different backgrounds, talents, skills and perspectives. A truly open community where everyone is welcome, has a voice and is encouraged to reach their full potential regardless of age, gender, race, nationality, disability, sexual orientation, religion, or any other characteristics.
Sounds like the right fit for you? Click Apply to submit your resume. A recruiter will contact you if your skills match our current or any future positions. In the meantime, stay updated on the latest SUSE news and job vacancies by joining our Talent Community.
SUSE Values
SUSE's culture is centered on four key values - Choice, Community, Trust, and Innovation - which are deeply integrated with our open source ethos. SUSE fosters a diverse and inclusive environment where our people are encouraged to be themselves.
Choice:
We are continuously making choice happen
We are accountable for our choices
We never get complacent
Community:
Nobody is smarter than everybody
We embrace diversity of thought
We are “open source first, upstream first” where collaboration benefits all
Trust:
We are trusted to deliver with integrity
We offer trust by default, and do not wait for it to be earned
We foster an environment where everyone trusts each other
Innovation:
We foster a culture of experimentation, and embrace change by challenging the norm
We are committed to continuous improvement, creativity and adaptability
Ideas are great, but without execution they are just ideas
Skills Required
- Production Kubernetes cluster operation (RKE2, EKS, GKE or equivalent)
- Helm, RBAC design, and multi-namespace workload management
- Production deployment and operation of a secrets management platform (HashiCorp Vault or equivalent) including PKI and dynamic credentials
- Policy-as-code authoring and enforcement in Kubernetes (OPA/Rego, Kyverno or equivalent)
- GitOps at production scale (Fleet, ArgoCD, Flux or equivalent) with drift reconciliation and rollback strategies
- Design and operate observability stack: log aggregation, log pipeline design, distributed tracing (OpenTelemetry), Prometheus/Grafana metrics and dashboards
- API/AI gateway engineering and operation (Kong, Envoy or equivalent) including route management and rate limiting
- Linux platform engineering: networking fundamentals, TLS and PKI, CSI storage operations, and container runtime experience
- Infrastructure-as-code development and maintenance with version-controlled configurations and peer review
- Operational experience: incident response, root-cause analysis, backups, failover testing, capacity management and runbook creation
- Secrets rotation, certificate lifecycle management, policy drift detection, and identity configuration operations
- Design and implement workload autoscaling for AI workflow workers to balance cost and performance
- Experience implementing governed AI model routing/gateway layers with auditable routing and per-consumer rate limiting
- Demonstrable hands-on production delivery experience across the listed areas (not just conceptual familiarity)
What We Do
SUSE is a global leader in innovative, reliable and secure enterprise open source solutions, including SUSE Linux Enterprise (SLE), Rancher and NeuVector. More than 60% of the Fortune 500 rely on SUSE to power their mission-critical workloads, enabling them to innovate everywhere – from the data center to the cloud, to the edge and beyond. SUSE puts the “open” back in open source, collaborating with partners and communities to give customers the agility to tackle innovation challenges today and the freedom to evolve their strategy and solutions tomorrow. For more information, visit www.suse.com








