Cloud Reliability Engineer

Reposted 22 Hours Ago
Be an Early Applicant
Hiring Remotely in Brazil
Remote
Senior level
Information Technology • Software
The Role
The Cloud Reliability Engineer will manage cloud infrastructure operations, ensure reliability through monitoring and automation, and collaborate on incident responses while implementing SRE principles.
Summary Generated by Built In

If you are looking for a meaningful career where people work and act with passion, rethink the existing and always strive to find the best solution - you have come to the right place. We develop future technologies to relentlessly make supply chains better. 

We are a leader in supply chain software solutions, helping organizations streamline operations, reduce costs, and improve efficiency.

Key Responsibilities

        Cloud Infrastructure Operations

o    Operate, maintain, and improve cloud infrastructure in AWS, Azure, or GCP environments.

o    Manage and optimize Kubernetes clusters — deployment, scaling, patching, and upgrades.

o    Ensure system availability, scalability, and performance through proactive monitoring and optimization.

o    Maintain infrastructure-as-code (IaC) for consistent and repeatable deployments.

        Automation & Continuous Improvement

o    Identify opportunities for operational automation to eliminate manual processes (“reduce toil”).

o    Build and maintain automated pipelines for deployments, configuration, and remediation.

o    Develop self-healing mechanisms to automatically detect and resolve common service issues.

o    Participate in continuous improvement initiatives around reliability, performance, and efficiency.

        Reliability Engineering

o    Implement SRE principles: define and track SLIs, SLOs, and error budgets.

o    Perform incident analysis and postmortems to identify root causes and prevent recurrence.

o    Design proactive monitoring, alerting, and observability dashboards (Dynatrace, DataDog).

o    Collaborate with DevOps and development teams to build reliable, observable, and resilient systems.

        CI/CD and Release Operations

o    Manage and optimize CI/CD pipelines to ensure reliable and consistent delivery.

o    Support deployment strategies (blue/green, canary, rolling) to reduce downtime risk.

o    Collaborate with Product and DevOps teams on release readiness and rollback automation.

        Incident Response & Troubleshooting

o    Monitor, troubleshoot, and resolve infrastructure and application issues

o    Respond to production incidents and ensure rapid mitigation and resolution.

o    Troubleshoot complex cloud, container, and networking issues across distributed systems.

o    Drive a culture of proactive monitoring, data-driven analysis, and preventive action.

Required Qualifications

        Bachelor’s degree in computer science, Engineering, or related field (or equivalent experience).

        5+ years of experience in experience in Cloud Engineering, DevOps, or Site Reliability roles.

        Hands-on experience with cloud platforms (OCI, AWS, Azure, or GCP).

        Strong knowledge of Kubernetes deployment, management, and troubleshooting

        Solid understanding of observability and monitoring (e.g., Dynatrace, DataDog) and incident management platforms.

        Proficiency in scripting and automation (e.g., Python, Bash, Terraform, Ansible).

        Strong troubleshooting and analytical skills across infrastructure and applications.

        Experience with incident response, RCA, and postmortem processes.

        A mindset of continuous improvement, reliability, and self-healing automation.

        Understanding of SRE principles, SLAs/SLOs/SLIs, and chaos engineering practices.

Preferred Skills

        Experience in conducting resilience assessments and recovery drills.

        Familiarity with ServiceNow and Dynatrace or other observability and ITSM tools.

        Experience with chaos engineering or resiliency testing frameworks

        Background in networking, load balancing, and performance tuning

        Strong communication and stakeholder management skills.

Soft Skills & Mindset

        Strong collaboration skills — comfortable working with developers, ops, and management.

        Clear communicator; able to translate technical issues into business impact.

        Self-starter with a problem-solving and automation-first mentality.

        Resilient under pressure — thrives in a dynamic, fast-paced environment.

        Passionate about operational excellence and continuous learning.

Key Success Metrics

        SLA/SLO compliance for critical services

        Reduction in MTTR (Mean Time to Recover)

        Increase in automated incident resolution rates

        Reduction in customer-impacting incidents

        Frequency and outcomes of resilience testing exercises

        Service uptime / availability

Why join us?   

At Infios, we're not just looking for employees; we're looking for partners in innovation, growth, and purpose. Meeting you where you are to create the future you need is at the core of who we are and what we do. Whether you're at the beginning of your career or a seasoned expert, we meet you on your journey, equipping you with the tools and opportunities to build the future you envision. Together, we will relentlessly work toward one common goal - making supply chains better.  

 

 

We believe the future is better when supply chains work better. 

We are an equal-opportunity employer and committed to inclusion in the workplace.  

At Infios, we believe that inclusion is a fundamental cornerstone of our success. We are committed to creating a safe and welcoming environment where every individual’s unique experiences and perspectives are valued—whether they look, think, move, believe, or love differently.  

  

All qualified applicants will receive consideration for employment without regard to race, color, ethnicity, national origin, sex, sexual orientation, gender identity, marital status, pregnancy, religion, age, disability, veteran status, genetic information, or any other characteristic protected by law.  
 
Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions of this role. If you require assistance or accommodation due to a disability during the recruiting process, please let us know at [email protected] 
 
Disclaimer: This job advertisement is not designed to cover a comprehensive listing of all duties or responsibilities that are required for this job. Please note that any salary information is a general guideline only. Individual compensation will be determined by various factors such as the scope and responsibilities of the position, experience, education, skills, location, and market and business considerations. Applications must be submitted via our career site. 

Skills Required

  • Bachelor's degree in computer science, Engineering, or related field (or equivalent experience)
  • 5+ years of experience in Cloud Engineering, DevOps, or Site Reliability roles
  • Hands-on experience with cloud platforms (OCI, AWS, Azure, or GCP)
  • Strong knowledge of Kubernetes deployment, management, and troubleshooting
  • Solid understanding of observability and monitoring (e.g., Dynatrace, DataDog)
  • Proficiency in scripting and automation (e.g., Python, Bash, Terraform, Ansible)
  • Strong troubleshooting and analytical skills across infrastructure and applications
  • Experience with incident response, RCA, and postmortem processes
  • A mindset of continuous improvement, reliability, and self-healing automation
  • Understanding of SRE principles, SLAs/SLOs/SLIs, and chaos engineering practices
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
1,488 Employees

What We Do

We relentlessly make supply chains better. For everyone. No matter your business size. Whatever your goal. No matter the challenge. No matter your starting point. We will meet you where you are to create the future you need. With a portfolio of adaptable solutions, we empower businesses of all sizes to simplify operations, optimize efficiency and drive measurable impact. Infios serves more than 5,000 customers across 70 countries, delivering adaptable and innovative technologies that evolve with changing business needs.

Similar Jobs

CrowdStrike Logo CrowdStrike

Technical Account Manager

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
Brazil
10000 Employees

CrowdStrike Logo CrowdStrike

Technical Account Manager

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
2 Locations
10000 Employees

Dynatrace Logo Dynatrace

Solutions Engineer

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Big Data Analytics • Automation
Remote or Hybrid
São Paulo, BRA
5200 Employees

monday.com Logo monday.com

Field Marketing Manager

Artificial Intelligence • Productivity • Sales • Software
Remote or Hybrid
São Paulo, BRA
3049 Employees

Similar Companies Hiring

Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Software
US
100 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account