SITE RELIABILITY ENGINEERING MANAGER

Posted 13 Days Ago
Be an Early Applicant
Camden, NJ
Hybrid
160K-190K Annually
Senior level
Information Technology • Logistics • Transportation • Analytics • Business Intelligence • 3PL: Third Party Logistics • Industrial
Best in Cold — Redefining Cold Storage Through Technology
The Role
The Site Reliability Engineering Manager will establish and lead the SRE practice, focusing on system reliability, monitoring implementation, and collaboration with engineering teams to enhance operational processes.
Summary Generated by Built In

Site Reliability Engineering Manager 

Ensure Reliability of Systems that Move the Nation's Food Supply 

Who We Are:

United States Cold Storage owns and operates one of the most complex temperature-controlled logistics networks in North America. Every day, our systems coordinate the storage and movement of food at national scale across a network of state-of-the-art distribution centers, including multiple highly automated warehouse facilities.

We continue to advance our core warehouse and logistics platforms. Our current focus is on modular, event-driven, API-first and cloud architectures. We continue to enhance reliability and accelerate engineering productivity by strengthening our SRE and AI practices. This is a large investment in innovation to continue to drive operational excellence at our facilities.

If you want to build durable systems that operate in the physical world at scale, this is that opportunity.

The Role:

The Site Reliability Engineering Manager will design and implement the company’s SRE framework from the ground up.

You will define what reliability means at US Cold.
You will establish SLIs and SLOs.
You will modernize monitoring and incident response.
You will build the playbook others will follow.

This is both a hands-on technical role and a practice-building leadership position.

You will report to the Director of IT Operations and collaborate across Software Engineering, Customer Integration Technology, Data Engineering, Infrastructure, and Security.

What You Will Own:

  • Establish the company’s first SRE practice including principles, standards, tooling, and operational processes.
  • Define SLIs, SLOs, and error budgets across SaaS, on-prem, and custom services.
  • Build reliability dashboards and executive-level reporting.
  • Implement and evolve observability across logs, metrics, and distributed tracing.
  • Mature incident response, outage management, and post-incident review processes.
  • Partner with engineering to design resilient systems and reduce operational toil.
  • Strengthen CI/CD reliability using safe deploy strategies such as canary and blue/green patterns.
  • Implement cost visibility and cloud governance in partnership with Finance.
  • Build runbooks, playbooks, and operational standards.
  • Establish on-call structures and escalation clarity.
  • Assist in hiring, mentoring, and developing future SRE team members.

                      This is foundational work. The systems and practices you design will shape how engineering operates for years.

                      Technical Environment:

                      • Azure cloud infrastructure
                      • Infrastructure as Code using Bicep, Terraform, or ARM
                      • GitHub Actions for CI/CD orchestration
                      • Safe deployment patterns including gated releases, canary, and blue/green
                      • Observability across logging, metrics, and distributed tracing
                      • Python scripting for automation and reliability tooling
                      • SaaS integrations, on-prem infrastructure, and custom-built services

                      What We’re Looking For:

                      • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
                      • 5–7+ years in SRE, DevOps, Infrastructure, or Production Engineering.
                      • Hands-on ownership of production services.
                      • Proven experience implementing SLIs, SLOs, observability, and automation.
                      • Leadership in major incident response and post-incident reviews.
                      • Deep CI/CD expertise, particularly GitHub Actions.
                      • Strong Python scripting for automation and operational tooling.
                      • Practical knowledge of cloud cost optimization and FinOps principles.
                      • Ability to influence cross-functional teams  

                                    Why This Role Is Different:

                                    This is not an inherited SRE function.
                                    There is no existing framework to simply maintain.

                                    You will: 

                                    • Define the reliability bar.
                                    • Build the operating model.
                                    • Influence architectural decisions.
                                    • Establish executive-level visibility into system health.
                                    • Create a culture where reliability is engineered, not reactive.

                                            This is an opportunity to build something durable inside a company modernizing its core technology platform.

                                            Compensation & Structure:

                                            •   Salary Range: $160,000.00 - $190,000.00/yr.
                                            •   Bonus Eligible
                                            •   Full-time, exempt
                                            •   Reports to: Director of IT Operations
                                            •   Travel less than 10%
                                            •   Location : Hybrid, Camden, NJ

                                            Operational Context:

                                            This role is primarily technical and office-based, with occasional interaction in operational environments depending on system needs. 

                                            Benefits Include:

                                            If annual hours are attained, these benefits may apply. Medical, Dental, Vision, Prescription, Legal Insurance, Pet Discount, Critical Illness, Accident Insurance, Hospital Indemnity, Long Term Care + Permanent Life Insurance, Identity Theft Protection, Short Term Disability Insurance, Long Term Disability Insurance, Supplemental Disability Insurance, Basic Life Insurance, Accidental Death and Dismemberment Insurance, Supplemental Life Insurance, Supplemental Spouse Life Insurance, Child Life Insurance, Loan Solution, Health Flexible Spending Account, Dependent Flexible Spending Account, Telemedicine, Virtual Primary Care, Prescription Savings Plan, Prescription Specialty Copay Assistance Program, Weight Management Program, Chronic Condition Management, Care Navigator Program, 24/7 Nurse Line, Expert Medical Opinion, Precious Additions Maternity Program, Health Advocacy, Employee Assistance Program, Digital Cognitive Behavioral Therapy, Digital Physical Therapy, Behavioral and Mental Health Platforms, Auto and home discount program, Secure Travel Protection, Discount Programs, 401(k) plan, Education Assistance, Paid Time Off, Referral program & Commuter Benefit (NJ ONLY).

                                            Physical & Operational Context:

                                            May require physical effort associated with using the computer to access information, or occasional standing, walking, lifting needed to carry out everyday activities.  Effective communication, vision, and hearing are essential for safety and productivity.  Operate scanners, tablets, radios, phones, computers, and other essential equipment as required.  Additional work hours may be requested by management to help manage employee production, projects, and/or special events.  Engage in frequent personal interaction and communication.  Attend in-person meetings and/or training on a regular basis.  Possess strong arithmetic and reading skills.  Follow verbal instructions, written instructions, and company policies.  Work independently and coordinate with others.  Fast-paced environment, managing stress and meeting productivity standards.

                                            Additional Information:

                                            Job functions may vary based on the area of operation. This description outlines the most common tasks required for the job.  Reasonable accommodation may be provided to enable individuals with disabilities to perform essential duties.  This job description may not encompass all tasks necessary to complete the role. 

                                            #INDIT
                                            Equal Opportunity Employer
                                            This employer is required to notify all applicants of their rights pursuant to federal employment laws. For further information, please review the Know Your Rights notice from the Department of Labor.

                                            Top Skills

                                            Arm
                                            Azure
                                            Bicep
                                            Github Actions
                                            Python
                                            Terraform
                                            Am I A Good Fit?
                                            beta
                                            Get Personalized Job Insights.
                                            Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

                                            The Company
                                            HQ: Camden, NJ
                                            4,300 Employees
                                            Year Founded: 1889

                                            What We Do

                                            For more than 125 years, United States Cold Storage (USCS) has protected America’s food supply through a national network of 39 world‑class, temperature‑controlled facilities across 14 states. Today, that mission is driven as much by software as it is by steel and concrete.

                                            Behind every truck, pallet, and movement is a modern, in‑house engineering organization building the mission‑critical systems that run our warehouses. Our teams develop and operate the technology that tracks inventory in real time, optimizes labor and throughput, integrates automation and robotics, and connects seamlessly with our partners’ supply chains. We work across multiple clouds and databases, leverage modern data platforms like Snowflake, and solve complex computer science problems—from optimization to routing at scale.

                                            With more than 4,300 employees and a 200+ person IT organization operating like a modern product and engineering group, USCS offers the rare combination of stability, scale, and complex technical challenges. We do whatever it takes to help our partners profitably reach their goals—whether that’s primary cold storage or fully integrated third‑party logistics—because when the temperature matters, getting it right matters even more.

                                            Our facilities are located in: California, Delaware, Florida, Georgia, Illinois, Indiana, Nebraska, North Carolina, Pennsylvania, Tennessee, Texas, Utah, and Virginia.

                                            Why Work With Us

                                            USCS is a family built on trust, teamwork, and pride. From freezer floors to engineering, everyone helps build the next generation of cold storage with practical tech that makes operations safer and smarter—because here, you don’t just keep food moving—you build how it moves next.

                                            Similar Jobs

                                            Easy Apply
                                            Remote or Hybrid
                                            USA
                                            2300 Employees
                                            132K-175K Annually

                                            Optum Logo Optum

                                            Chief Information Officer

                                            Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
                                            In-Office
                                            9 Locations
                                            160000 Employees

                                            Optum Logo Optum

                                            Machine Learning Engineer

                                            Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
                                            In-Office or Remote
                                            Basking Ridge, NJ, USA
                                            160000 Employees
                                            44-79 Hourly

                                            CoreWeave Logo CoreWeave

                                            Senior Software Engineer

                                            Cloud • Information Technology • Machine Learning
                                            In-Office
                                            4 Locations
                                            1450 Employees
                                            139K-204K Annually

                                            Similar Companies Hiring

                                            Scotch Thumbnail
                                            Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
                                            US
                                            25 Employees
                                            Milestone Systems Thumbnail
                                            Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
                                            Lake Oswego, OR
                                            1500 Employees
                                            Bellagent Thumbnail
                                            Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
                                            Chicago, IL
                                            20 Employees

                                            Sign up now Access later

                                            Create Free Account

                                            Please log in or sign up to report this job.

                                            Create Free Account