QAD Jobs

Sr. Site Reliability Engineer - SRE

QAD

Sr. Site Reliability Engineer - SRE

Posted 6 Hours Ago

Be an Early Applicant

Hiring Remotely in Barcelona, Cataluña, ESP

In-Office or Remote

Senior level

Software

The Role

Lead SRE efforts to design, implement, and maintain highly available, scalable systems. Own Datadog-based observability, build reliability-focused software tooling, automate infrastructure (Terraform, GitHub Actions, GitOps), run on-call/incident response (OpsGenie), define SLIs/SLOs, drive toil reduction, and mentor teams on reliability best practices.

Summary Generated by Built In

Company Description

Redzone is the #1 Connected Workforce Solution for manufacturers big and small. We work to improve efficiency in plants, provide coaching for best practices, and enable the front-line worker to improve the quality of their work and their work life by providing them with tools, processes, and collaboration tools to keep their manufacturing lines running smoothly and efficiently.

At Redzone we focus on the customer experience, listening to the customer, and providing solutions that create great outcomes. We are a combination of great leadership, years of manufacturing experience, and an incredible technology team that all work together to create great products.

This role is fully remote.

Job Description

We are expanding our Site Reliability Engineering (SRE) team and seeking a highly skilled and passionate Senior SRE to join us. As a member of our growing SRE function, you will play a critical role in ensuring the reliability, scalability, and performance of our mission-critical services that power our customer experience. This is an exciting opportunity to shape our SRE practices, drive automation, and significantly impact our product's operational excellence.

What You'll Do:

Drive Operational Excellence: Design, implement, and maintain highly available, scalable, and resilient systems that deliver exceptional customer experience.
Datadog Expert: Be one of the go-to experts for Datadog. You will be responsible for defining, implementing, and enforcing best practices for monitoring, alerting, logging, tracing, and synthetic testing across our entire AWS environment. This includes deep hands-on configuration, dashboarding, troubleshooting, and optimization within Datadog.
https://www.smartrecruiters.com/app/jobs/details/1a099a5c-2719-44ea-b9fb-43833ab4f60f/jobad/726f1bba-3ffb-4544-a5ec-d689eea24fc0 1/4
5/29/26, 10:48 AM Job • SmartRecruiters
Software Development for Reliability: Develop robust, well-tested, and maintainable software and tooling to automate operational tasks, create self-service capabilities for engineering teams, and enhance system reliability. This will involve building applications, not just scripts. Toil Reduction Champion: Identify and eliminate toil through automation, process improvements, and systematic problem-solving. Work proactively to shift our operational focus from reactive firefighting to proactive engineering.
Incident Management & Post-Mortems: Contribute to and evolve our incident response framework, participating in on-call rotations (using OpsGenie). Lead blameless post-mortems, extracting actionable insights and driving systemic improvements to prevent recurrence. Reliability Metrics & Goals: Collaborate with engineering teams to define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets. Use these metrics to drive continuous improvement and make data-driven decisions about reliability investments. Infrastructure as Code: Leverage and contribute to our infrastructure as code (IaC) efforts, moving towards a fully automated environment using Terraform and GitHub Actions.
System Design & Architecture: Provide SRE expertise in system design reviews, influencing architectural decisions to build reliability, observability, and scalability into our services from the ground up.
Knowledge Sharing & Mentorship: Document processes, build runbooks, and share your expertise with both the SRE team and broader engineering organization. Help foster an SRE culture of shared ownership and continuous learning.

Qualifications

What You'll Bring

Core SRE Capabilities

Demonstrated experience operating and improving production systems at scale in an SRE, Production Engineering, or Platform Engineering role.
Proven ability to rapidly build accurate mental models of complex distributed systems across infrastructure, applications, networking, identity, and observability domains.
Strong troubleshooting skills with a methodical, evidence-driven approach to incident response and root cause analysis.
Experience defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to guide reliability decisions.
Excellent written and verbal communication skills, with the ability to explain complex technical issues clearly to both technical and non-technical audiences.

Technical Domains

Experience across several of the following areas:

Kubernetes platforms, including Amazon EKS, and service mesh technologies such as Istio.
Cloud infrastructure and services within AWS.
Identity and access management systems, including Auth0 and AWS IAM.\
Networking fundamentals, including DNS, load balancing, routing, TLS, and connectivity troubleshooting.
GitOps workflows and infrastructure automation using tools such as Flux and Terraform.
Observability platforms and practices, including metrics, logs, traces, alerting, dashboards, and synthetic monitoring.
CI/CD systems and engineering workflows.
Application logging and distributed system debugging.
Engineering Mindset

A strong SRE:

Prioritizes service stability and customer impact during incidents.
Slows down under pressure, gathers facts, and communicates clearly.
Reduces operational complexity through automation and simplification.
Identifies and eliminates toil through self-service tooling and process improvement.
Demonstrates strong scripting and automation instincts.
Brings a systems-thinking approach to problem-solving.
Balances short-term remediation with long-term reliability improvements.

Software Engineering for Reliability

Demonstrated ability to build and maintain automation, tooling, and self-service capabilities using one or more programming or scripting languages such as Python, Go, or Bash.
Focuses on applying software engineering practices to improve reliability, reduce toil, and enhance developer productivity. Behavioral Expectations
Calm and effective during high-severity incidents.
Skilled at managing complex situations involving multiple teams and competing priorities.
Able to lead blameless post-mortems and drive meaningful follow-up actions.
Passionate about continuous improvement and fostering a culture of shared ownership.

Bonus Points (Nice to Have):

Experience defining and working with SLOs, SLIs, and Error Budgets.
Familiarity with other observability tools or concepts beyond Datadog.
Experience with feature flagging platforms like LaunchDarkly.

Additional Information

QAD Inc. is a leading provider of adaptive, cloud-based enterprise software and services for global manufacturing companies. Global manufacturers face ever-increasing disruption caused by technology-driven innovation and changing consumer preferences. In order to survive and thrive, manufacturers must be able to innovate and change business models at unprecedented rates of speed. QAD calls these companies Adaptive Manufacturing Enterprises. QAD solutions help customers in the automotive, life sciences, packaging, consumer products, food and beverage, high tech and industrial manufacturing industries rapidly adapt to change and innovate for competitive advantage.

QAD is committed to ensuring that every employee feels they work in an environment that values their contributions, respects their unique perspectives and provides opportunities for growth regardless of background. QAD’s DEI program is driving higher levels of diversity, equity and inclusion so that employees can bring their whole self to work.

We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.

About QAD and QAD Redzone:

QAD Redzone helps to enable QAD’s vision for the Adaptive Enterprise. Labor productivity improvements directly impact efficiency. Productive and empowered employees increase the effective capacity of your plant and accelerate time to productivity for new employees giving manufacturers the agility to increase production beyond what was previously possible without having to invest in production equipment or new plants, and reduce the amount and impact of employee attrition. Empowered employees with a growth mindset take extreme ownership of challenges that impact their production goals, creating resilience in the face of disruption.

#LI-Remote

Skills Required

Operate and improve production systems at scale (SRE/Platform/Production Engineering)
Deep, hands-on expertise with Datadog (monitoring, alerting, dashboards, logging, tracing, synthetic tests)
Experience with AWS cloud infrastructure and services
Kubernetes platform experience, including Amazon EKS, and service mesh technologies such as Istio
Infrastructure as Code experience using Terraform and CI automation with GitHub Actions
GitOps workflows and tooling such as Flux
Define and implement SLIs, SLOs, and error budgets
Incident management and on-call experience, including post-mortems (OpsGenie)
Strong troubleshooting, root cause analysis, and distributed systems debugging skills
Software engineering for reliability: build automation and tooling using Python, Go, or Bash
Observability practices across metrics, logs, traces, alerting, dashboards, and synthetic monitoring
Familiarity with networking fundamentals: DNS, load balancing, routing, TLS, connectivity troubleshooting
Experience with CI/CD systems and engineering workflows
Experience with identity and access management systems such as Auth0 and AWS IAM
Experience with feature flagging platforms (e.g., LaunchDarkly)
Familiarity with other observability tools or concepts beyond Datadog

QAD Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about QAD and has not been reviewed or approved by QAD.

Leave & Time Off Breadth — Time off includes vacation, sick time, and paid parental leave, and is described positively across materials. Some U.S. roles reference “unlimited” PTO, indicating breadth beyond standard allocations.
Wellbeing & Lifestyle Benefits — A virtual-first model with a home-office setup allowance and a monthly work-from-home stipend supports flexible, remote work. These lifestyle perks strengthen work–life balance and overall benefits appeal.
Retirement Support — A 401(k) with company match is part of the core package and is regarded favorably. Retirement benefits are positioned as competitive within the offering.

Learn more about QAD's Compensation & Benefits →

QAD Insights

What's It Like to Work at QAD? QAD Culture & Values QAD Career Growth & Development What's the Work-Life Balance Like at QAD? QAD Leadership & Management QAD Company Growth, Stability & Outlook

View all jobs at QAD

View QAD Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Santa Barbara, CA

1,678 Employees

Year Founded: 1979

What We Do

QAD Inc. is a leading provider of next-generation manufacturing and supply chain solutions in the cloud. To succeed in a turbulent world, facing disruptions in supply and fluctuations in demand, manufacturers and supply chains must rapidly respond to change and seamlessly optimize agility, efficiency, and resilience for effective customer service. QAD delivers Adaptive Applications to enable these Adaptive Enterprises. Founded in Santa Barbara, California, QAD has customers in 84 countries around the world. Thousands of companies have deployed QAD enterprise solutions including enterprise resource planning (ERP), digital commerce (DC), supplier relationship management (SRM), digital supply chain planning (DSCP), global trade and transportation execution (GTTE), enterprise quality management system (EQMS), connected workforce and process intelligence. To learn more, visit www.qad.com, call +1 (805) 566-6100 or email [email protected]. Follow us on Twitter: https://twitter.com/QAD_Community Like our page on Facebook: https://www.facebook.com/QADerp Follow us on Instagram: https://www.instagram.com/qad_community