Platform Reliability Analyst - BOT

Posted 10 Days Ago
Hiring Remotely in India
Remote
1-3 Years Experience
Agency • Big Data • eCommerce • Professional Services • Analytics • Consulting
We co‑innovate with the world’s most ambitious brands to create transformative digital experiences.
The Role
The Platform Reliability Analyst ensures the health of cloud infrastructure by monitoring performance, managing incident responses, and collaborating with teams to resolve issues. Responsibilities include proactive system monitoring, coordinating outages, and conducting root cause analyses. The ideal candidate has strong communication skills and an understanding of cloud technologies to support system optimization.
Summary Generated by Built In

The Platform Reliability Analyst is responsible for ensuring the continuous monitoring and overall health of our cloud infrastructure hosted on platforms such as AWS, Rackspace, Expedient, and Heroku. This role involves proactive monitoring of system performance, coordinating incident response efforts, and collaborating with development, cloud, and operations teams to address issues before they impact the business. The ideal candidate will have a process-oriented mindset, strong communication skills, and a foundational understanding of cloud technologies to facilitate rapid resolution of incidents and optimize system performance.

Key Responsibilities:

  • Proactive System Monitoring: Oversee system performance and availability through continuous monitoring of alerts from various APM tools (New Relic, Cloudwatch, etc.). Provide feedback on alert tuning, identify patterns in incidents, and pinpoint optimization opportunities (e.g., identifying idle systems that could be shut down).
  • Production Support: Build and maintain a comprehensive understanding of all software systems and their variations. Ensure readiness to support production systems by identifying potential issues before they affect customers.
  • Outage Management: Lead the incident command center during outages with a focus on rapid resolution. Coordinate incident response by:
  • Recording incident start/end times and affected systems.
  • Notifying internal stakeholders and support teams of the incident status.
  • Coordinating the involvement of the correct teams and ensuring all relevant details are shared.
  • Providing and executing runbooks, or coordinating with cloud teams for execution.
  • Running incident bridges, ensuring systems, logs, and traffic are monitored and relevant experts are involved.
  • Documenting facts versus theories in real-time during incident resolution.
  • Incident Communication: Notify the company about incidents and coordinate with support to inform customers. Eventually, manage status updates on a future status page for system transparency.
  • Incident Prevention and Follow-Up: Be the first line of defense—proactively identify system issues before customers are impacted. Conduct root cause analysis (RCA) after incidents to determine underlying issues and implement preventative measures. Update and create runbooks as needed.
  • Collaboration and Coordination: Regularly set up meetings with cloud and development teams to address and resolve recurring issues. Communicate proactively with leadership about any potential cost increases or system inefficiencies.
  • System Health Metrics: Monitor traffic, system health, security perimeter, and overall performance. Track key metrics such as the percentage of issues identified proactively versus reactively.

Key Skills and Qualifications:

  • Strong Communication Skills: Clear, concise English to convey the status of incidents and performance issues to both technical and non-technical stakeholders.
  • Process-Oriented Mindset: Ability to follow, document, and improve processes to ensure smooth incident management and resolution.
  • Attention to Detail: Capability to record key details about system health, performance, and incident facts versus theories in real-time.
  • Familiarity with Monitoring Tools: Experience using monitoring and alerting tools such as New Relic, Cloudwatch, or Datadog, and familiarity with logs, traffic monitoring, and system health metrics.
  • Coordination and Leadership Skills: Ability to lead incident response teams, coordinate with various technical experts, and manage communication effectively during outages.
  • Basic Technical Understanding: While not an engineering role, some technical familiarity with cloud environments, system alerts, and security practices is important. Entry-level engineers with an interest in coordination roles are encouraged to apply.
  • Collaboration: Ability to work cross-functionally with development, cloud, and support teams to ensure smooth operations and proactive issue resolution.

Top Skills

AWS

What the Team is Saying

Amanda Ruzin
Abbey
Jon
Markel Lewis
Catrina Ahlbach
The Company
HQ: Chicago, IL
1,800 Employees
Remote Workplace
Year Founded: 2003

What We Do

Founded in 2003 in Chicago, we’re a leading global digital experience consultancy that elevates brand experiences and drives superior client outcomes with services in Strategy & Insights, Experience Design, Technology, Analytics, and Marketing. Our services are designed around a single, unifying purpose: to help brands compete and win through a continuous and collaborative partnership we call co-innovation. Our people, our clients, and our partners are what put the “us” in Bounteous.

Why Work With Us

Our company's success is a product of our people. We hire the best and brightest professionals to co-innovate with brands on their digital futures. With state-of-the-art collaboration centers worldwide we’re able to co-innovate with clients and build lasting connections with team members to foster career growth.

Gallery

Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery

Bounteous Offices

Remote Workspace

Employees work remotely.

Our remote-first teams of highly talented individuals collaborate and co-innovate across the globe. We know that productivity can happen anywhere. You’re encouraged to work how and where you perform best.

Typical time on-site: None
HQChicago, IL
Atlanta, GA
Chennai, IN
Delhi, Dehli
Gurugram, IN
Hyderabad, IN
Pittsburgh, PA
Learn more

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account