Sr. SRE / DevOps

NTD Software

| México | Remote

Sorry, this job was removed at 10:36 a.m. (CST) on Monday, June 24, 2024

View 62745 Jobs

Find out who’s hiring remotely Nationwide

See all Remote jobs Nationwide

View 62745 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of production systems. The role focuses on monitoring, alerting, and dashboard creation with a strong emphasis on SRE tools like Grafana, Prometheus, and Datadog. The ideal candidate should have hands-on experience with Python scripting and be able to collaborate effectively with cross-functional teams to address service issues and improve system reliability.

Requirements

+4 years of experience in similar roles
Fluent English
Experience with creating and modifying Grafana dashboards for system monitoring.
Knowledge of Prometheus for setting up and maintaining monitoring systems.
Experience with Datadog for user and system monitoring.
Hands-on experience with Python scripting for automation and other tasks.
Understanding of SRE practices, including monitoring, alerting, and incident response.
Ability to create and enhance runbooks for incident response and remediation.
Experience with DevOps practices, such as CI/CD and infrastructure automation, is a secondary desired skill set.
Strong communication skills to collaborate with cross-functional teams and stakeholders.
Ability to proactively identify and address service issues.
Familiarity with ITIL process experience, including Service Management, Knowledge Management, and Incident Management.
Experience with user and system monitoring, remediation, and implementation to maintain service stability.

Responsibilities

Create and modify Grafana dashboards to monitor system performance and user experience.
Set up and maintain monitoring and alerting systems using Prometheus and Datadog.
Collaborate with cross-functional teams to improve service reliability and respond to incidents.
Develop and enhance runbooks for incident response and remediation.
Proactively work with alerting to ensure timely detection of issues and minimize downtime.
Implement monitoring, remediation, and other operational practices to maintain high service levels.

More Information on NTD Software

NTD Software operates in the Consulting industry. The company is located in San Francisco, California . NTD Software was founded in 2021. It has 21 total employees. To see all 12 open jobs at NTD Software, click here.

Read Full Job Description

Sr. SRE / DevOps

Similar Jobs

Sign in for the full experience