Lead Site Reliability Engineer

Posted 10 Days Ago
Be an Early Applicant
Washington, DC
In-Office
Mid level
Aerospace • Defense • Manufacturing
Accelerating defense innovation.
The Role
As Lead Site Reliability Engineer, you'll ensure reliability and performance of AI infrastructure, manage deployments, and mentor junior engineers.
Summary Generated by Built In

About Bridge Defense. Bridge Defense is redefining how modern defense technology is delivered. Based in Washington, D.C., we are built for the dynamic mission environment facing the Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full-spectrum national security solutions that combine secure infrastructure, cleared talent, and mission-ready software to meet evolving defense challenges. Our services include secure software development in classified environments and the design and implementation of advanced IT and cybersecurity capabilities ranging from secure cloud architectures and enterprise infrastructure to data center operations, scientific analysis, and cutting-edge cyber defense.


We are led by technologists and veterans with firsthand mission experience, which enables us to understand both the operational realities and the innovation needed to succeed. Our approach is agile and outcome-based, delivering results in weeks rather than months whenever possible.


At Bridge Defense we value people, integrity, and excellence. We foster an environment where innovation thrives in support of traditional mission requirements. Our team members receive competitive compensation, robust benefits, professional development and certification opportunities, and clear paths for growth while working on the nation’s most critical projects.


Core Values:

  • Innovation & Responsiveness: We push beyond legacy models with efficient, tech-led solutions built to scale and evolve.
  • Trusted Performance: Security, compliance, and deep experience in delivering to demanding environments guides all we do.
  • Mission Focused Expertise: From veteran leadership to cleared engineers, our people understand both the technology and the mission.

About the Role 

As the Lead Site Reliability Engineer for our ComputeBridge Engagement, you’ll be responsible for the reliability, scalability, and performance of one of the largest hardware and AI infrastructure efforts in the U.S. defense sector. You will lead the deployment, management, and automation of a high-performance computing mesh across multiple secure environments, ensuring operational excellence and mission continuity for a 9-figure government program. 

 
This is a hands-on engineering leadership role that bridges physical infrastructure and modern DevOps automation, ideal for someone who thrives at the intersection of hardware systems, distributed computing, and AI/ML workflows. 

 

What You’ll Do 

  • Lead infrastructure design, deployment, and operations for ComputeBridge hardware clusters across secure and distributed environments 
  • Install and configure physical systems, including high-density GPU servers, networking gear, and storage arrays 
  • Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms 
  • Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.) 
  • Operate and maintain distributed networking meshes across multiple classified and unclassified domains 
  • Implement and manage out-of-band management tools (IMPI, iDRAC, BMC, etc.) for remote troubleshooting and control 
  • Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads 
  • Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance 
  • Provide on-site technical leadership for deployments, troubleshooting, and continuous improvement 
  • Mentor junior engineers and establish operational best practices across the ComputeBridge program as the contract grows 

 

What You’ll Bring 

  • 3+ years of experience in site reliability, systems engineering, or hardware operations roles 
  • Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting 
  • Strong experience with Linux systems administration, imaging, and automated deployment 
  • Hands-on experience managing large-scale clusters or distributed systems in OpenShift or Kubernetes environments 
  • Familiarity with DevOps automation (Ansible, Terraform, CI/CD pipelines) 
  • Experience configuring and managing networking and mesh architectures 
  • Direct experience with NVIDIA GPUs, CUDA, and related AI/ML frameworks 
  • Proficiency with out-of-band management and IMPI/iDRAC tooling 
  • Certifications: Linux+ and Security+ (required or in-progress) 
  • Excellent communication, documentation, and problem-solving skills 
  • Clearance: Active TS/SCI required or ability to obtain 

 

Bonus Points For 

  • Experience operating in secure DoD or intelligence environments 
  • Familiarity with Palantir platforms or other government data systems 
  • Prior experience supporting AI/ML infrastructure in production or tactical settings 
  • Experience with performance tuning and monitoring of HPC or GPU-accelerated clusters 

General Factors:

  • Depending on project requirements, may be required to work within a compressed schedule; overtime should be expected when schedules demand it.
  • Willing to travel, if needed.
  • No Relocation.


Why Bridge Defense 

  • Shape how advanced computing supports national security missions at scale 
  • Lead engineering for a major government program with direct mission impact 
  • Competitive compensation, benefits, and growth opportunities in a mission-driven environment 

 

Bridge Defense is committed to building a collaborative and mission-focused team. Bridge Defense reserves the right to modify job duties or requirements at any time. Employment with Bridge Defense is at-will. Candidates must be eligible to work in the United States and complete any required background checks or security clearance processes as a condition of employment. 

Top Skills

Ansible
Bmc
Ci/Cd
Cuda
Idrac
Impi
Kubernetes
Linux
Nvidia Gpus
Openshift
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Washington, DC
14 Employees

What We Do

Today's dynamic mission environment demands a novel approach to contracting. We support complex and demanding missions with technology, comprehensive integration, cleared talent, and innovative delivery.

Similar Jobs

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
5 Locations
5550 Employees
127K-249K Annually

Sailor Health Logo Sailor Health

Provider Partnerships Manager (Maryland)

Healthtech • Social Impact • Telehealth
In-Office
3 Locations
20 Employees
50K-85K Annually

Tempus AI Logo Tempus AI

(Senior) Medical Science Liaison - Mid-Atlantic

Artificial Intelligence • Big Data • Healthtech • Machine Learning • Analytics • Biotech • Generative AI
Remote or Hybrid
2 Locations
3775 Employees
120K-190K Annually

U.S. News & World Report Logo U.S. News & World Report

Marketing Manager

Consumer Web • Digital Media • Information Technology • News + Entertainment
Hybrid
2 Locations
542 Employees
95K-120K Annually

Similar Companies Hiring

Red 6 Thumbnail
Virtual Reality • Software • Hardware • Defense • Aerospace
Orlando, Florida
155 Employees
Onebrief Thumbnail
Software • Defense
Honolulu, HI
220 Employees
Turion Space Thumbnail
Software • Manufacturing • Information Technology • Hardware • Defense • Artificial Intelligence • Aerospace
Irvine, CA
150 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account