Barbaricum

Senior Reliability Engineer

Reposted 21 Days Ago

Be an Early Applicant

Washington, DC, USA

In-Office

Senior level

Other

The Role

Ensure reliability, availability, and performance of on-premises, cloud, and hybrid systems. Implement monitoring, automated alerting, incident response, automation, chaos engineering, and rollback/recovery. Build dashboards, runbooks, and scripts; analyze capacity and performance; respond to outages and conduct post-incident reviews. Collaborate with development, cloud, and cybersecurity teams and implement security best practices across operations.

Summary Generated by Built In

Barbaricum is a rapidly growing government contractor providing leading-edge support to federal customers, with a particular focus on Defense and National Security mission sets. We leverage more than 17 years of support to stakeholders across the federal government, with established and growing capabilities across Intelligence, Analytics, Engineering, Mission Support, and Communications disciplines. Founded in 2008, our mission is to transform the way our customers approach constantly changing and complex problem sets by bringing to bear the latest in technology and the highest caliber of talent.

Headquartered in Washington, DC's historic Dupont Circle neighborhood, Barbaricum also has a corporate presence in Tampa, FL, Bedford, IN, and Dayton, OH, with team members across the United States and around the world. As a leader in our space, we partner with firms in the private sector, academic institutions, and industry associations with a goal of continually building our expertise and capabilities for the benefit of our employees and the customers we support. Through all of this, we have built a vibrant corporate culture diverse in expertise and perspectives with a focus on collaboration and innovation. Our teams are at the frontier of the Nation's most complex and rewarding challenges. Join our team.

Barbaricum is seeking an experienced Senior Site Reliability Engineer to support the reliability, availability, automation, and operational performance of IT and cloud systems under the Military Community and Family Policy (MC&FP) Outreach and Digital Enterprise Services (MODES) contract. You will help ensure MC&FP systems are reliable, scalable, resilient, and efficiently managed through proactive monitoring, automated incident response, performance optimization, and operational dashboards that support rapid decision-making

Responsibilities:

Monitor and maintain system reliability, availability, and performance across on-premises, cloud, and hybrid IT environments supporting MC&FP mission requirements.
Implement proactive performance monitoring, automated alerting, incident response workflows, and resilience engineering practices to reduce downtime and improve operational visibility.
Develop, maintain, and improve scalable automated infrastructure solutions that support reliable system operations and repeatable service delivery.
Implement rollback strategies, recovery approaches, and chaos engineering practices to validate resilience, reduce operational risk, and improve system stability.
Analyze usage patterns, capacity trends, and performance indicators to support dynamic scaling, resource optimization, and system improvement decisions.
Develop and maintain real-time operational dashboards, reports, and metrics that enable rapid decision-making, leadership awareness, and system optimization.
Respond to and resolve system outages, impairments, and service disruptions while coordinating with technical teams to minimize mission impact.
Conduct post-incident reviews to identify root causes, document lessons learned, and implement preventative measures that reduce recurrence.
Collaborate with software developers, cloud engineers, cybersecurity personnel, and operations teams to improve services, reliability patterns, deployment practices, and operational standards.
Create and maintain system documentation, configuration standards, operational runbooks, monitoring procedures, and service reliability guidance.
Automate common operations tasks to reduce manual workloads, improve consistency, and increase system efficiency.
Implement security best practices across operational activities, infrastructure automation, monitoring, incident response, and system administration functions.

Required Skills:

Expert knowledge of site reliability engineering practices, system monitoring, incident management, automation, performance tuning, and operational resilience.
Strong understanding of Windows and Linux administration, infrastructure operations, system configuration, service management, and troubleshooting practices.
Experience with automation platforms and configuration management tools such as Ansible, Puppet, Chef, or similar technologies.
Proficiency with scripting languages such as Python, Shell, PowerShell, or similar tools used to automate operational and infrastructure tasks.
Knowledge of cloud services and infrastructure across AWS, Microsoft Azure, Google Cloud, or comparable cloud environments.
Strong understanding of network troubleshooting, configuration, connectivity analysis, system dependencies, and performance bottleneck identification.
Ability to design, interpret, and maintain dashboards, alerts, metrics, logs, and operational reporting that support service health and decision-making.
Ability to conduct root cause analysis, post-incident reviews, and corrective action planning in complex technical environments.
Strong problem-solving skills and the ability to work under pressure during outages, impairments, and time-sensitive operational issues.
Excellent written and verbal communication skills, with the ability to explain technical findings, incident impacts, and reliability recommendations to technical and non-technical stakeholders.

Required Qualifications:

Bachelor's degree in Computer Science, Information Technology, Systems Engineering, Cybersecurity, or a related field; Master's degree preferred.
Certifications related to cloud computing, system administration, site reliability engineering, DevSecOps, or automation are beneficial.
10+ years of experience in site reliability engineering, systems administration, infrastructure operations, cloud operations, DevSecOps, or a similar technical role, particularly in a government, federal, defense, or secure IT setting.
Demonstrated experience maintaining reliable, scalable, and efficiently managed IT systems across on-premises, cloud, or hybrid environments.
Experience developing automated infrastructure, operational scripts, monitoring solutions, dashboards, runbooks, and configuration standards.
Experience supporting incident response, system outage resolution, post-incident reviews, root cause analysis, and operational improvement initiatives.
Experience collaborating with development, infrastructure, cloud, cybersecurity, and program teams to improve reliability, security, and service performance.
DoD Secret Security Clearance.

EEO Commitment

All qualified applicants will receive consideration for employment without regard to sex, race, ethnicity, age, national origin, citizenship, religion, physical or mental disability, medical condition, genetic information, pregnancy, family structure, marital status, ancestry, domestic partner status, sexual orientation, gender identity or expression, veteran or military status, or any other basis prohibited by law.

Skills Required

Expert knowledge of site reliability engineering practices, system monitoring, incident management, automation, performance tuning, and operational resilience.
Strong understanding of Windows and Linux administration, infrastructure operations, system configuration, service management, and troubleshooting practices.
Experience with automation platforms and configuration management tools such as Ansible, Puppet, Chef, or similar technologies.
Proficiency with scripting languages such as Python, Shell, PowerShell, or similar tools used to automate operational and infrastructure tasks.
Knowledge of cloud services and infrastructure across AWS, Microsoft Azure, Google Cloud, or comparable cloud environments.
Strong understanding of network troubleshooting, configuration, connectivity analysis, system dependencies, and performance bottleneck identification.
Ability to design, interpret, and maintain dashboards, alerts, metrics, logs, and operational reporting that support service health and decision-making.
Ability to conduct root cause analysis, post-incident reviews, and corrective action planning in complex technical environments.
Experience developing automated infrastructure, operational scripts, monitoring solutions, dashboards, runbooks, and configuration standards.
Experience supporting incident response, system outage resolution, post-incident reviews, root cause analysis, and operational improvement initiatives.
Experience collaborating with development, infrastructure, cloud, cybersecurity, and program teams to improve reliability, security, and service performance.
Excellent written and verbal communication skills, with the ability to explain technical findings to technical and non-technical stakeholders.
Bachelor's degree in Computer Science, Information Technology, Systems Engineering, Cybersecurity, or a related field (Master's preferred).
Certifications related to cloud computing, system administration, site reliability engineering, DevSecOps, or automation are beneficial.
10+ years of experience in site reliability engineering, systems administration, infrastructure operations, cloud operations, DevSecOps, or similar technical roles, particularly in government, federal, defense, or secure IT settings.
DoD Secret Security Clearance.

View all jobs at Barbaricum

View Barbaricum Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Washington, DC

170 Employees

Year Founded: 2008

What We Do

Headquartered in Washington, D.C., Barbaricum is a Service-Disabled Veteran-Owned small business. At our core, you’ll find people who love to explore and innovate. Our team has a uniquely complementary skill set. Together we’ve built a hands-on, all-inclusive contracting firm that develops innovative strategies & uses the best of emerging technologies to support our clients’ long-term goals. Our growth has been fueled by repeat business and long-term partnerships with key clients. We are an ISO 9001: 2015-certified and CMMI Level 3-appraised company that supports a host of government clients with Integrated Communications, Mission Support, Research and Analysis, Cyber Security/Intelligence, and Technology-Enabled Services. Our mission is to transform U.S. Government approaches to problem sets of increasing complexity by delivering innovative solutions, especially in support of National Security missions. Barbaricum is one of the fastest growing companies in our market. The company is routinely recognized by institutions like Inc. Magazine, GovCon, AMEC, PRSA, and SmartCEO for corporate growth, capabilities, and award-winning client work. Our team is dynamic and agile, providing global support to current missions across five continents. We are also focused on developing and maintaining our vibrant corporate culture, having most recently been named a Best Workplace for 2017 by Inc. Magazine.