Infios Jobs

Cloud Reliability Engineer

Infios

Cloud Reliability Engineer

Reposted 8 Hours Ago

Be an Early Applicant

Hiring Remotely in Brazil

Remote

Senior level

Information Technology • Software

The Role

The Cloud Reliability Engineer will manage cloud infrastructure operations, ensure reliability through monitoring and automation, and collaborate on incident responses while implementing SRE principles.

Summary Generated by Built In

If you are looking for a meaningful career where people work and act with passion, rethink the existing and always strive to find the best solution - you have come to the right place. We develop future technologies to relentlessly make supply chains better.

We are a leader in supply chain software solutions, helping organizations streamline operations, reduce costs, and improve efficiency.

Key Responsibilities

▶ Cloud Infrastructure Operations

o Operate, maintain, and improve cloud infrastructure in AWS, Azure, or GCP environments.

o Manage and optimize Kubernetes clusters — deployment, scaling, patching, and upgrades.

o Ensure system availability, scalability, and performance through proactive monitoring and optimization.

o Maintain infrastructure-as-code (IaC) for consistent and repeatable deployments.

▶ Automation & Continuous Improvement

o Identify opportunities for operational automation to eliminate manual processes (“reduce toil”).

o Build and maintain automated pipelines for deployments, configuration, and remediation.

o Develop self-healing mechanisms to automatically detect and resolve common service issues.

o Participate in continuous improvement initiatives around reliability, performance, and efficiency.

▶ Reliability Engineering

o Implement SRE principles: define and track SLIs, SLOs, and error budgets.

o Perform incident analysis and postmortems to identify root causes and prevent recurrence.

o Design proactive monitoring, alerting, and observability dashboards (Dynatrace, DataDog).

o Collaborate with DevOps and development teams to build reliable, observable, and resilient systems.

▶ CI/CD and Release Operations

o Manage and optimize CI/CD pipelines to ensure reliable and consistent delivery.

o Support deployment strategies (blue/green, canary, rolling) to reduce downtime risk.

o Collaborate with Product and DevOps teams on release readiness and rollback automation.

▶ Incident Response & Troubleshooting

o Monitor, troubleshoot, and resolve infrastructure and application issues

o Respond to production incidents and ensure rapid mitigation and resolution.

o Troubleshoot complex cloud, container, and networking issues across distributed systems.

o Drive a culture of proactive monitoring, data-driven analysis, and preventive action.

Required Qualifications

▶ Bachelor’s degree in computer science, Engineering, or related field (or equivalent experience).

▶ 5+ years of experience in experience in Cloud Engineering, DevOps, or Site Reliability roles.

▶ Hands-on experience with cloud platforms (OCI, AWS, Azure, or GCP).

▶ Strong knowledge of Kubernetes deployment, management, and troubleshooting

▶ Solid understanding of observability and monitoring (e.g., Dynatrace, DataDog) and incident management platforms.

▶ Proficiency in scripting and automation (e.g., Python, Bash, Terraform, Ansible).

▶ Strong troubleshooting and analytical skills across infrastructure and applications.

▶ Experience with incident response, RCA, and postmortem processes.

▶ A mindset of continuous improvement, reliability, and self-healing automation.

▶ Understanding of SRE principles, SLAs/SLOs/SLIs, and chaos engineering practices.

Preferred Skills

▶ Experience in conducting resilience assessments and recovery drills.

▶ Familiarity with ServiceNow and Dynatrace or other observability and ITSM tools.

▶ Experience with chaos engineering or resiliency testing frameworks

▶ Background in networking, load balancing, and performance tuning

▶ Strong communication and stakeholder management skills.

Soft Skills & Mindset

▶ Strong collaboration skills — comfortable working with developers, ops, and management.

▶ Clear communicator; able to translate technical issues into business impact.

▶ Self-starter with a problem-solving and automation-first mentality.

▶ Resilient under pressure — thrives in a dynamic, fast-paced environment.

▶ Passionate about operational excellence and continuous learning.

Key Success Metrics

▶ SLA/SLO compliance for critical services

▶ Reduction in MTTR (Mean Time to Recover)

▶ Increase in automated incident resolution rates

▶ Reduction in customer-impacting incidents

▶ Frequency and outcomes of resilience testing exercises

▶ Service uptime / availability

Why join us?

At Infios, we're not just looking for employees; we're looking for partners in innovation, growth, and purpose. Meeting you where you are to create the future you need is at the core of who we are and what we do. Whether you're at the beginning of your career or a seasoned expert, we meet you on your journey, equipping you with the tools and opportunities to build the future you envision. Together, we will relentlessly work toward one common goal - making supply chains better.

We believe the future is better when supply chains work better.

We are an equal-opportunity employer and committed to inclusion in the workplace.

At Infios, we believe that inclusion is a fundamental cornerstone of our success. We are committed to creating a safe and welcoming environment where every individual’s unique experiences and perspectives are valued—whether they look, think, move, believe, or love differently.

All qualified applicants will receive consideration for employment without regard to race, color, ethnicity, national origin, sex, sexual orientation, gender identity, marital status, pregnancy, religion, age, disability, veteran status, genetic information, or any other characteristic protected by law.

Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions of this role. If you require assistance or accommodation due to a disability during the recruiting process, please let us know at [email protected]

Disclaimer: This job advertisement is not designed to cover a comprehensive listing of all duties or responsibilities that are required for this job. Please note that any salary information is a general guideline only. Individual compensation will be determined by various factors such as the scope and responsibilities of the position, experience, education, skills, location, and market and business considerations. Applications must be submitted via our career site.

Skills Required

Bachelor's degree in computer science, Engineering, or related field (or equivalent experience)
5+ years of experience in Cloud Engineering, DevOps, or Site Reliability roles
Hands-on experience with cloud platforms (OCI, AWS, Azure, or GCP)
Strong knowledge of Kubernetes deployment, management, and troubleshooting
Solid understanding of observability and monitoring (e.g., Dynatrace, DataDog)
Proficiency in scripting and automation (e.g., Python, Bash, Terraform, Ansible)
Strong troubleshooting and analytical skills across infrastructure and applications
Experience with incident response, RCA, and postmortem processes
A mindset of continuous improvement, reliability, and self-healing automation
Understanding of SRE principles, SLAs/SLOs/SLIs, and chaos engineering practices

View all jobs at Infios

View Infios Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

1,488 Employees

What We Do

We relentlessly make supply chains better. For everyone. No matter your business size. Whatever your goal. No matter the challenge. No matter your starting point. We will meet you where you are to create the future you need. With a portfolio of adaptable solutions, we empower businesses of all sizes to simplify operations, optimize efficiency and drive measurable impact. Infios serves more than 5,000 customers across 70 countries, delivering adaptable and innovative technologies that evolve with changing business needs.