Site Reliability Engineer

Posted 5 Days Ago
Be an Early Applicant
Cedar Rapids, IA
Hybrid
124K-163K Annually
Senior level
Insurance
The Role
The Site Reliability Engineer is responsible for the reliability and performance of critical production systems, guiding teams in enhancing uptime and service quality through automation, monitoring tools, and implementing best practices for system resilience.
Summary Generated by Built In

UFG is currently hiring for a Site Reliability Engineer who is the senior-most engineer on the Production Management team, responsible for ensuring the reliability, performance, scalability, and efficiency of critical production systems and services. This role combines software engineering, systems engineering, solutions architecture, and a deep knowledge of how technology functions in order to troubleshoot, operate, and enhance highly reliable distributed systems. With their deep knowledge of the entire tech stack, they will provide guidance and support to technology teams across Business Enablement and lead triage and resolution of the most challenging problems. The ideal candidate is proactive, automation-driven, and passionate about implementing solutions that enhance uptime, service quality, and developer productivity.

Essential Duties & Responsibilities: 

  • Implement tooling to monitor system health, capacity, and performance at all levels, from hardware through the VMs and all the way to the end-user interface.
  • Work with the production management team to troubleshoot incidents, restore service, and identify root causes.
  • Recommend architectural and implementation of changes to products delivered by development teams based on their performance in test, performance, and production environments.
  • Support continuous improvement of ITIL processes through automation, data driven insights, and proactive problem identification.
  • Documents and Integrate SRE practices into the ITIL framework, including incident, change, and problem management workflows.
  • Develop automation for system provisioning, monitoring, deployment, and recovery to reduce manual effort and human error.
  • Develop and maintain comprehensive runbooks, standard operating procedures (SOPs), and knowledge base articles for recurring operational tasks and incident response actions.
  • Collaborate with development teams to design resilient architecture and implement best practices for reliability and observability.
  • Enhance observability by developing and maintaining dashboards, alerts, and performance analytics.
  • Contribute to capacity planning, performance tuning, and resilience testing to ensure system health.
  • Develop and update problem management documentation, ensuring known errors and workarounds are captured within the ITSM system.
  • Manage incident response and participate in on-call rotations to ensure service reliability.
  • Define, document and track key reliability metrics (SLIs, SLOs, SLAs) and implement continuous improvement initiatives.
  • Drive post-incident reviews (PIRs) and develop actionable insights to prevent future occurrences.
  • Partner with security teams to ensure systems meet compliance, security, and governance standards.
  • Evaluate and recommend new tools, technologies, and frameworks to improve operational efficiency.
  • Monitor network systems, servers, and applications.
  • Contribute to capacity planning, performance tuning, and resilience testing to ensure system health.
  • Use all necessary tools to investigate performance and reliability of systems in testing environments. Provide detailed and specific guidance on ways to eliminate bottlenecks, improve resilience, and optimize speed and reliability.
  • Provide mentorship and technical support to other members of Production Management.

Job Specifications: 

Education: 

  • Bachelor’s degree in information technology, Computer Science, or a related field, or equivalent experience
  • Master’s or other advanced degree preferred.

Experience: 

  • 10+ years of experience in progressively more demanding enterprise-scale technology roles
  • 3+ years of experience as a Site Reliability Engineer or Senior DevOps Engineer
  • 3+ years in software development, architecture, or related engineering discipline

Knowledge, skills & abilities: 

  • Advanced experience with multiple enterprise monitoring and observability tools, including Dynatrace, PRTG, DTrace, SolarWinds, and similar.
  • Complete Windows fluency mandatory; similar strengths in LINUX and Unisys Mainframe environments helpful
  • Excellent problem-solving and communication skills, with the ability to collaborate across cross-functional teams.
  • Unparalleled understanding of:
    • advanced networking concepts and complete expertise in the entire TCP/IP stack
    • VM (VMware and HyperV) and physical compute performance and tuning, including networking and storage performance
    • VM (Java, Python, Browser, and similar VM environments) threading, garbage collection, and general performance
    • SQL Server expertise, including troubleshooting queries, indexes, and general performance
    • Experience with unstructured database performance
    • General understanding of LLM/SLM implementations and GPU implementations
  • Proficiency in automation and scripting languages
  • Good understanding of ITIL processes (Incident, Change, Problem, and Service Level Management).

Working Conditions: 

  • General office environment.
  • This position will require off-hour escalations for incidents that occur outside of normal working hours.

Pay Transparency Statement:

UFG Insurance is committed to fair and equitable compensation practices. The base salary range for this position is $123,865 - $163,368 annually, which represents the typical range for new hires in this role. Individual pay within this range will be determined based on a variety of factors, including relevant experience, education, certifications, skills, internal equity, geography and market data. 

In addition to base salary, UFG Insurance offers a comprehensive total rewards package that includes:

  • Annual incentive compensation
  • Medical, dental, vision & life insurance
  • Accident, critical Illness & short-term disability insurance
  • Retirement plans with employer contributions
  • Generous time-off program
  • Programs designed to support the employee well-being and financial security.

This pay range disclosure is provided in accordance with applicable state and local pay transparency laws.

Equal Opportunity Employer
This employer is required to notify all applicants of their rights pursuant to federal employment laws. For further information, please review the Know Your Rights notice from the Department of Labor.

Top Skills

Dtrace
Dynatrace
Hyperv
Java
Linux
Prtg
Python
Solarwinds
SQL Server
Unisys Mainframe
VMware
Windows
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Cedar Rapids, Iowa
1,006 Employees
Year Founded: 1946

What We Do

Founded in 1946 as United Fire & Casualty Company, UFG Insurance (Nasdaq: UFCS) is engaged in the business of writing property and casualty insurance through its insurance company subsidiaries. Headquartered in Cedar Rapids, Iowa, UFG is licensed as a property and casualty insurer in 50 states, plus the District of Columbia, and is represented by approximately 1,000 independent agencies. A.M. Best Company assigns a financial strength rating of “A-” (Excellent) for the members of United Fire & Casualty Group, with a stable outlook, reflecting long-term balance sheet strength.

Similar Jobs

DFIN Logo DFIN

Site Reliability Engineer

Fintech • Software
Remote or Hybrid
United States
1750 Employees

AMP Logo AMP

Site Reliability Engineer

Artificial Intelligence • Computer Vision • Greentech • Machine Learning • Robotics • Industrial • Automation
Easy Apply
Remote or Hybrid
United States
170 Employees
100K-120K Annually

Zscaler Logo Zscaler

Site Reliability Engineer

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Remote or Hybrid
USA
8697 Employees

Milestone Systems Logo Milestone Systems

Site Reliability Engineer

Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
Remote or Hybrid
2 Locations
1500 Employees
160K-180K Annually

Similar Companies Hiring

Globe Life Thumbnail
Insurance • Financial Services
McKinney, TX
1657 Employees
MassMutual India Thumbnail
Insurance • Information Technology • Fintech • Financial Services • Big Data
Hyderabad, Telangana
Granted Thumbnail
Mobile • Insurance • Healthtech • Financial Services • Artificial Intelligence
New York, New York
23 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account