Job Summary
We are seeking a skilled Site Reliability Engineer (SRE) with 3–4 years of Azure Kubernetes experience and 7+ years in event-driven microservices to design, deploy, and operate scalable, secure AKS clusters. You’ll use GitHub Copilot to automate AKS provisioning, generate deployment templates, and optimize CI/CD pipelines, ensuring our healthcare platform (HQY) delivers high performance and reliability. This role is ideal for an SRE passionate about Kubernetes, Azure services, and security standards (e.g., ISO 27000, NIST 800), who thrives in fast-paced, remote environments and is excited to leverage GenAI for operational efficiency.
ResponsibilitiesAs a Senior Site Reliability Engineer for the New API pods project, you will:
- Lead the design, deployment, and ongoing operations of Azure Kubernetes Service (AKS) clusters that support event-driven microservices critical to our healthcare platform.
- Utilize GitHub Copilot to automate AKS provisioning, generate deployment templates, and optimize CI/CD pipelines, enhancing operational efficiency and reducing manual overhead.
- Develop and maintain infrastructure as code using Terraform and PowerShell scripts to automate deployments and manage cloud resources consistently and securely.
- Drive continuous integration and continuous delivery (CI/CD) pipelines using Azure DevOps, ensuring rapid, reliable, and repeatable software releases.
- Implement and enforce security and compliance standards aligned with ISO 27000 and NIST 800 frameworks, safeguarding sensitive healthcare data and infrastructure.
- Enhance system observability and monitoring by integrating Azure Event Hubs, Azure Application Insights, and Prometheus, enabling proactive incident detection and resolution.
- Manage authentication and authorization mechanisms using OAuth2, Pod Security Policies, TLS, Managed Identities, and Service Principals to secure microservices and APIs.
- Optimize cluster performance and resource utilization by configuring autoscalers and tuning Kubernetes components.
- Support microservices deployment and traffic management using service mesh technologies such as Istio and Envoy, ensuring secure and reliable inter-service communication.
- Collaborate closely with software engineering, security, and operations teams to align infrastructure capabilities with application requirements and business goals.
- Leverage GenAI tools and automation to continuously improve operational workflows and reduce toil.
- Participate in agile ceremonies and contribute to a culture of continuous improvement, knowledge sharing, and innovation.
Requirements
- Proven AKS Expertise: 3–4 years of hands-on experience designing, deploying, and operating Azure Kubernetes Service (AKS), with 7+ years in scalable, secure, event-driven microservices.
- Strong hands-on knowledge of ISTIO, Kusto, Helm, and Envoy for service mesh and observability.
- Proficiency in Azure services (e.g., App Service, Service Bus, Event Hubs, ACR) and serverless computing.
- Experience with database technologies like Azure SQL Server, MongoDB, and PostgreSQL.
- Expertise in REST APIs and Swagger/OpenAPI for API specification and integration.
- Strong understanding of Agile, Scrum, or Kanban software development life cycles.
- Automation & Tooling:
- 2+ years building CI/CD pipelines with Jenkins, Terraform, and Ansible Playbooks, with expertise in YAML scripting for AKS manifests and automation.
- Experience with Git or SVN for code versioning and deployment automation.
- Familiarity with ARM templates, PowerShell, or alternatives (CloudFormation, Ansible, Chef, Puppet) for infrastructure automation.
- GenAI Proficiency: Hands-on experience with GitHub Copilot or similar GenAI tools to accelerate scripting (e.g., YAML, Terraform, PowerShell), debug AKS configurations, and generate observability queries or test data.
- Security & Compliance: Deep knowledge of NIST, FedRAMP, CSA, and ISO 27000 standards, with experience implementing Pod Security Policies (PSP), node-to-node encryption, and HTTPS Ingress with TLS certificate management.
- Observability: Expertise in App Insights, Prometheus, and Kusto for real-time monitoring and diagnostics.
- Authentication/Authorization: Strong fundamentals in managing managed identities, service principals, and certificates for secure AKS access.
Nice-to-Have Skills
- Experience with Azure Active Directory or other IAM platforms for Kubernetes authentication.
- Familiarity with healthcare compliance (e.g., HIPAA) for secure data handling.
- Knowledge of Microsoft AutoGen for agentic automation workflows.
- Prior work with Azure Container Service (ACS) or Linux/Windows Kubernetes clusters.
Similar Jobs
What We Do
Experts in crafting digital products ⚡️
At Thaloz, the mission is to support at every stage of the digital product journey. With a team of over 100 experts and a global presence in 30 countries, we leverage top-tier Latin American talent to deliver exceptional software development solutions that drive success.
Our Services:
→ Product Lab: Comprehensive product development services to build and scale software solutions. From strategy and design to development, testing, and launch, every aspect is handled with expertise.
→ Talent Hub: Accelerate the team-building process by 50% with carefully vetted LATAM talent. Select the team members, and they will be seamlessly integrated into projects under the client's leadership.
→ Enterprise Pod: Optimize operations with streamlined complex integrations and flawless implementations of digital products for B2B companies, ensuring rapid and smooth deployments.
Ready to assist in turning ideas into reality, get in touch through www.thaloz.com/contact-us
Join our community! 👨💻
Instagram: @thalozteam
YouTube: @thalozteam
Clutch: @thaloz








