Sr. SRE / DevOps

| México | Remote
Sorry, this job was removed at 10:36 a.m. (CST) on Monday, June 24, 2024
Find out who’s hiring remotely Nationwide
See all Remote jobs Nationwide
Apply
By clicking Apply Now you agree to share your profile information with the hiring company.

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of production systems. The role focuses on monitoring, alerting, and dashboard creation with a strong emphasis on SRE tools like Grafana, Prometheus, and Datadog. The ideal candidate should have hands-on experience with Python scripting and be able to collaborate effectively with cross-functional teams to address service issues and improve system reliability.

Requirements

  • +4 years of experience in similar roles
  • Fluent English
  • Experience with creating and modifying Grafana dashboards for system monitoring.
  • Knowledge of Prometheus for setting up and maintaining monitoring systems.
  • Experience with Datadog for user and system monitoring.
  • Hands-on experience with Python scripting for automation and other tasks.
  • Understanding of SRE practices, including monitoring, alerting, and incident response.
  • Ability to create and enhance runbooks for incident response and remediation.
  • Experience with DevOps practices, such as CI/CD and infrastructure automation, is a secondary desired skill set.
  • Strong communication skills to collaborate with cross-functional teams and stakeholders.
  • Ability to proactively identify and address service issues.
  • Familiarity with ITIL process experience, including Service Management, Knowledge Management, and Incident Management.
  • Experience with user and system monitoring, remediation, and implementation to maintain service stability.

Responsibilities

  • Create and modify Grafana dashboards to monitor system performance and user experience.
  • Set up and maintain monitoring and alerting systems using Prometheus and Datadog.
  • Collaborate with cross-functional teams to improve service reliability and respond to incidents.
  • Develop and enhance runbooks for incident response and remediation.
  • Proactively work with alerting to ensure timely detection of issues and minimize downtime.
  • Implement monitoring, remediation, and other operational practices to maintain high service levels.
More Information on NTD Software
NTD Software operates in the Consulting industry. The company is located in San Francisco, California . NTD Software was founded in 2021. It has 21 total employees. To see all 12 open jobs at NTD Software, click here.
Read Full Job Description
Apply Now
By clicking Apply Now you agree to share your profile information with the hiring company.

Similar Jobs

Apply Now
By clicking Apply Now you agree to share your profile information with the hiring company.
Learn more about NTD SoftwareFind similar jobs