Intermedia Cloud Communications

Site Reliability Engineer - IDP

Posted 25 Days Ago

Be an Early Applicant

Hiring Remotely in Portugal

Remote

Mid level

Cloud • Information Technology • Consulting

The Role

Design, implement, and maintain reliability, observability, and automation for production services. Define SLIs/SLOs, automate deployments and recovery, improve CI/CD, run incident response and root cause analysis, conduct chaos exercises, and collaborate with platform, dev, and security teams to increase resilience and operational maturity.

Summary Generated by Built In

About Intermedia:
Intermedia has established itself as a leading provider of cloud communications and collaboration tech that allows companies to connect better. We have a strong track record of growth, profitability, and creating an environment where everyone matters. Everyone. While we are fast-paced and admittedly a bit intense, we promise that you won’t be bored. You will find Intermedia is a place where you can indulge your passion for creating and supporting great cloud technology. What’s more, we always look to promote from within and have many employees who have been with us 10, 15, and 20+ years!
Are you looking for a company where YOUR VOICE is heard? Where you can MAKE A DIFFERENCE? Do you THRIVE in a FAST-PACED work environment? Do you wake every morning EXCITED to work with GREAT PEOPLE and create SUCCESS TOGETHER? Then Intermedia is the place for you.

Culture at Intermedia is built on teamwork and transparency. We hold each other accountable and always have each other’s back!

Are you ready to make your mark?

About the Role

Intermedia’s Site Reliability Engineers (SREs) play a critical role in ensuring the reliability, availability, scalability, and performance of our most important applications and services.

As an SRE at Intermedia, you will focus on improving application reliability and operational excellence, working closely with software engineering, platform, and DevOps teams to design, monitor, automate, and continuously enhance service stability. You will apply engineering principles to operations, reduce manual effort, strengthen observability, and minimize downtime, ensuring our services are resilient and ready to support our customers at all times.

This role is a strong fit for someone who is hands-on, highly analytical, and comfortable working across application, platform, and operational boundaries to improve production reliability at scale.
While primarily remote, this role requires occasional visits to the office in Coimbra. We plan to open offices in Aveiro and Porto in the future. This approach gives team members the flexibility to work remotely while also coming together in the office for collaboration and teamwork.

What you will be doing:

Ensure the availability, performance, and reliability of critical applications and services by designing and implementing robust monitoring, alerting, and optimization strategies.
Define, measure, and maintain SLIs, SLOs, and error budgets to support service reliability goals.
Partner with development teams to improve performance, reduce latency, and increase the resilience of applications in production.
Work closely with platform and DevOps teams to ensure smooth alignment between infrastructure and application reliability.
Define reliability standards and operational guardrails for platform capabilities and golden paths.
Partner with platform engineering teams to design resilient self-service capabilities.
Automate operational tasks such as deployments, rollbacks, scaling, failover, and recovery processes.
Continuously improve CI/CD pipelines to reduce manual intervention and support safe, progressive delivery practices.
Integrate automated validation, reliability checks, and operational guardrails into development and deployment workflows.
Implement and maintain observability capabilities across production systems, including metrics, logs, traces, and dashboards.
Develop dashboards, alerts, and operational views that provide real-time visibility into system health and application behavior.
Act as a key responder during incidents, collaborating across teams to troubleshoot, mitigate, and resolve production issues.
Conduct root cause analysis for incidents and drive long-term corrective actions to prevent recurrence.
Run fire drills, game days, and chaos engineering exercises to validate system resilience under failure conditions.
Monitor resource usage, capacity trends, and scaling behavior to support future growth and performance needs.
Partner with security teams to ensure services align with security best practices, including secure communication, access controls, and data protection.
Lead or contribute to regular production readiness and operational review meetings to assess system health, review incidents, and prepare for releases.
Promote reliability engineering best practices across teams and help strengthen the overall operational maturity of the organization.

What you will bring to the role:

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Proven experience in Site Reliability Engineering, Platform Engineering, or Infrastructure/DevOps roles with strong operational ownership
Strong expertise in application monitoring, observability platforms, incident response, and troubleshooting in production environments.
Strong understanding of reliability engineering concepts such as SLIs, SLOs, error budgets, alerting quality, and incident management.
Proficiency in scripting and automation using tools and languages such as Python, Bash, Terraform, Ansible, or similar.
Experience with cloud platforms such as AWS, Azure, or Google Cloud.
Strong knowledge of CI/CD pipelines, deployment automation, and progressive delivery practices.
Strong knowledge of infrastructure as code and configuration management approaches.
Experience with containerization and orchestration, such as Docker and Kubernetes.
Strong problem-solving skills, operational judgment, and attention to detail.
Excellent communication and collaboration skills, with the ability to work effectively across engineering, platform, and security teams.

Bonus Skills

Experience with chaos engineering practices and tools.
Experience supporting internal platforms or platform engineering teams.
Familiarity with developer portals, golden paths, service catalogs, or self-service platform patterns.
Understanding of developer experience metrics and operational maturity for internal platforms.
Familiarity with microservices architectures and multi-tenant environments.
Experience with modern observability stacks and telemetry standards.
Understanding of UCaaS and CCaaS platforms, especially voice and communication service flows.
Experience leading reliability initiatives, incident reviews, or production improvement programs.
Familiarity with capacity planning, resilience testing, and operational readiness practices.

Diversity, Inclusion, and Equal Opportunity

We hire, promote, and compensate employees based on their ability to perform their job responsibilities, without regard to race, color, creed, religion, sex, gender, marital status, national origin, ancestry, age, citizenship, physical or mental disability, sexual orientation, or any other basis protected by applicable law (collectively referred to in our Code of Conduct as “Protected Classes”). We do not tolerate employment discrimination in the workplace, and we are committed to making reasonable accommodations for identified disabilities or other limitations as required by all applicable laws. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

About
To explore other opportunities check out our careers page: https://www.intermedia.com/about-us/careers

Skills Required

Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience
Proven experience in Site Reliability Engineering, Platform Engineering, or Infrastructure/DevOps with operational ownership
Expertise in application monitoring, observability platforms, incident response, and production troubleshooting
Understanding of SLIs, SLOs, error budgets, alerting quality, and incident management
Proficiency in scripting and automation (Python, Bash) and tooling (Terraform, Ansible or similar)
Experience with cloud platforms (AWS, Azure, or Google Cloud)
Strong knowledge of CI/CD pipelines, deployment automation, and progressive delivery practices
Knowledge of infrastructure-as-code and configuration management approaches
Experience with containerization and orchestration (Docker, Kubernetes)
Strong problem-solving skills, operational judgment, and attention to detail
Excellent communication and collaboration skills for cross-team work
Experience with chaos engineering practices and tools
Experience supporting internal platforms or platform engineering teams
Familiarity with developer portals, golden paths, service catalogs, or self-service platform patterns
Familiarity with microservices architectures and multi-tenant environments
Experience with modern observability stacks and telemetry standards
Understanding of UCaaS and CCaaS platforms, especially voice and communication service flows
Experience leading reliability initiatives, incident reviews, or production improvement programs
Familiarity with capacity planning, resilience testing, and operational readiness practices

View all jobs at Intermedia Cloud Communications

View Intermedia Cloud Communications Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Sunnyvale, CA

1,169 Employees

Year Founded: 1993

What We Do

Intermedia is the cloud communications company that helps over 135,000 businesses connect better – through voice, video conferencing, chat, contact center, business email and productivity, file sharing and backup, security, archiving, and more – from wherever, whenever. We strive to eliminate the need for multiple communications service providers with a seamlessly integrated portfolio of communications and collaboration solutions – all delivered through one highly reliable and secure platform. With month-to-month contract options, one monthly bill, one intuitive point of administrative control, and having been certified by J.D. Power seven times for providing “An Outstanding Customer Service Experience," Intermedia is committed to providing enterprise-grade products to businesses of all sizes through a simple, Worry-Free Experience. As a partner-first company, Intermedia goes to work for over 7,500 channel partners by providing a comprehensive set of programs, resources, and support to help them grow their revenue and maximize their success. Programs include our Customer Ownership Reseller (CORE™) model – which enables partners to resell, package, and manage Intermedia's solutions as if they were their own, while benefiting from highly attractive economic terms and maintaining ownership of their customer relationships – as well as agent models. Intermedia is also proud to be the exclusive cloud communications platform provider for NEC, a leader in global market share for unified communications with an estimated 80+ million business phone users worldwide. Recent Awards: • J.D. Power Certified Assisted Technical Support Program – 2023, 2021, 2020, 2019, 2018, 2017, 2016 • Inc. Magazine’s Best Workplaces of 2021 • PCMag Editor’s Choice – Intermedia Unite • PCMag Editor’s Choice – Intermedia AnyMeeeting • PCMag Editor’s Choice – Intermedia Hosted Exchange • CRN - 5-Star Partner Program - 2023, 2022 • CRN’s Cloud Computing Product of 2022, 2021 – Intermedia Unite