IT Infrastructure Support Site Reliability Engineer II

Posted 19 Days Ago
Atlanta, GA, USA
In-Office
66K-104K Annually
Senior level
Information Technology
The Role
As a Site Reliability Engineer, you will ensure the reliability and performance of infrastructure systems, automate processes, troubleshoot incidents, and maintain compliance with SLIs and SLOs, while participating in on-call rotations.
Summary Generated by Built In

About the Job

We are seeking an experienced Site Reliability Engineer to join our IT Infrastructure Support team,
responsible for ensuring the reliability, scalability, and performance of critical physical security
infrastructure and supporting systems. In this role, you will combine software engineering expertise with
operations knowledge to build and maintain automation tools, monitoring systems, and processes that
support enterprise-grade server, network, and security device management. You will work closely with
cross-functional teams to define and enforce service level objectives, reduce operational toil through
automation, and drive continuous improvement in system resilience. This position requires 24x5
availability with on-call rotation to ensure uninterrupted support for mission-critical infrastructure.

Key Responsibilities

 Partner with leadership to establish, monitor, and enforce Service Level Indicators (SLIs) and
Service Level Objectives (SLOs) for infrastructure tooling, including configuration compliance
rates, patch success rates, and deployment latency metrics.
 Provide Level 3 expertise for tooling-specific incidents, focusing on automating incident
remediation workflows and reducing Mean Time To Repair (MTTR) through intelligent
automation and runbook development.
 Identify and automate repetitive manual tasks across managed infrastructure, targeting measurable
reductions in operational overhead (e.g., 50% reduction in manual server build time) through
scripting and workflow automation.
 Conduct thorough root cause analysis and lead blameless postmortems for all major service-
impacting incidents, driving systemic improvements in tooling reliability and infrastructure
resilience.
 Engineer and maintain automated processes and scripts to populate, update, and synchronize asset
management platforms (e.g., NetBox), configuration management databases, and monitoring
systems for internal and external stakeholders.
 Design, develop, and deploy full-stack applications, custom plugins, and automation scripts to
extend functionality of management and monitoring systems, enabling direct device interaction for
configuration management.
 Develop and maintain fully automated Infrastructure-as-Code configurations for Windows and
Linux server roles using tools such as Ansible, Terraform, or Puppet, including drift detection and
auto-remediation capabilities.
 Build end-to-end automation pipelines for vulnerability patching, security baseline enforcement
(CIS benchmarks), and continuous compliance auditing against internal and regulatory standards
for physical security devices.
 Develop API-driven tools for network configuration management, automated firmware updates,
pre/post-change validation, and real-time network health monitoring across the device fleet.
 Deploy and standardize monitoring agents, centralized log collection systems, and custom
dashboards with alerts based on critical SLIs (latency, error rate, saturation, traffic) for servers and
edge devices.
 Build automation scripts for intelligent ticket handling, problem validation, and escalation
workflows within enterprise ticketing systems, ensuring 2-hour initial response SLAs are
consistently met.
 Participate in 24x5 on-call rotation to provide timely support for infrastructure systems, security
devices, and related tooling, ensuring service continuity and rapid incident response.

Required Skills

 6+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
 Strong proficiency in Python, Bash, and PowerShell for automation scripting, with experience in
Go for building high-performance backend services and APIs.

 Hands-on experience with Infrastructure-as-Code tools (Terraform, Ansible, Chef, or Puppet) and
configuration management practices, including drift detection, version control, and automated
remediation.
 Advanced knowledge of Linux and Windows server environments, including Tier 3
troubleshooting capabilities, system hardening, and enterprise-scale server management.
 Solid understanding of enterprise networking concepts, Cisco device administration, network
automation protocols (NETCONF/RESTCONF), and experience with network monitoring and
flow analysis tools.
 Experience implementing and managing monitoring solutions (Prometheus, Grafana, Datadog)
and centralized logging platforms (ELK Stack), with ability to create custom dashboards and
alerting rules.
 Proficiency in implementing CI/CD pipelines, automated testing frameworks, and deployment
strategies using modern DevOps tooling, with strong emphasis on code quality, security, and
maintainability.

Salary Range

$66,120.00 - $104,400.00 USD (Salary)
  • Please note that the salary information provided herein is base pay only (gross); it does not include other forms of compensation which may or may not apply to this specific position, namely, performance-based bonuses, benefits-related payments, or other general incentives - none of which are guaranteed, may be subject to specific eligibility requirements, and are wholly within the discretion of Astreya to remit.
  • Further, the salary information noted above is a range that consists of a minimum and maximum rate of pay for this specific position. Where an applicant or employee is placed on this range will depend and be contingent on objective, documented work-related considerations like education, experience, certifications, licenses, preferred qualifications, among other factors.

Astreya offers comprehensive benefits to all Regular, Full-Time Employees, including:

  • Medical provided through UHC (PPO, HSA, Surest options) / Medical provided through Kaiser (HMO option only) for California employees only

  • Dental provided through UHC

  • Nationwide Vision provided by UHC

  • Flexible Spending Account for Health & Dependent Care

  • Pre-Tax Account for Commuter Benefit/Parking & Transit (location-specific)

  • Continuing Education and Professional Development via various integrated platforms, e.g. Udemy and Coursera

  • Corporate Wellness Program provided by Goomi Group

  • Employee Assistance Program

  • Wellness Days

    401k Plan

  • Basic and Supplemental Life Insurance

  • Short Term & Long Term Disability

  • Critical Illness, Critical Hospital, and Voluntary Accident Insurance

  • Tuition Reimbursement (available 6 months after start date, capped)

  • Paid Time Off (accrued and prorated, maximum of 120 hours annually)

  • Paid Holidays

  • Any other statutory leaves, paid time, or other ancillary benefits required under state and federal law

Top Skills

Ansible
Bash
Chef
Ci/Cd
Datadog
Elk Stack
Go
Grafana
Netconf
Powershell
Prometheus
Puppet
Python
Restconf
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
1,958 Employees
Year Founded: 2001

What We Do

Astreya is the leading IT solutions provider for some of the world's most recognizable and innovative organizations. Our journey started in 2001 in the heart of Silicon Valley and reaches thirty-three countries with over 2200+ IT professionals. We enable businesses to make better decisions, achieve operational efficiency and gain a competitive edge. The Astreya advantage is centered around focus and clear- vision, world-class talent, and innovative technology: Creativity is in our DNA. Our dedicated Software and Service Innovation teams bring best-in-class technology and tools to bear for our clients.

Similar Jobs

Cloudflare Logo Cloudflare

Senior Director of Solution Engineering, Enterprise

Cloud • Information Technology • Security • Software • Cybersecurity
Remote or Hybrid
United States
4400 Employees

Cloudflare Logo Cloudflare

Data Center Infrastructure Management (DCIM) Administrator - Infrastructure Operations

Cloud • Information Technology • Security • Software • Cybersecurity
Hybrid
6 Locations
4400 Employees

CDW Logo CDW

Professional Services Manager

Information Technology
Remote or Hybrid
GA, USA
15100 Employees
151K-277K Annually

mabl Logo mabl

Head of Customer Success and Technical Account Management

Artificial Intelligence • Machine Learning • Software
Remote or Hybrid
United States
80 Employees

Similar Companies Hiring

Axle Health Thumbnail
Logistics • Information Technology • Healthtech • Artificial Intelligence
Santa Monica, CA
19 Employees
Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
15 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account