Senior Site Reliability Engineer - OPS00023

Posted 17 Days Ago
Be an Early Applicant
3 Locations
In-Office or Remote
Senior level
Information Technology • Software
The Role
The Senior Site Reliability Engineer will ensure GPU clusters and AI infrastructure are reliable, automated, and scalable, focusing on optimizing workloads and observability.
Summary Generated by Built In

We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders.

Over the past few years, we’ve been hiring specialists all over the world while our main development centers were in Ukraine. Now, we keep expanding and start growing our centers in different parts of the world. Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now. We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty.

About this opportunity

We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team. In this role, you’ll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale. You’ll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure.


What's in it for you:

• Join a fast-scaling company shaping the future of AI infrastructure in Europe

• Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads

• Collaborate with a top-tier international team and grow through global AI and cloud events


Is that you?

• 5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments

• Expertise in HPC workload managers (Slurm, PBS Pro, LSF)

• Strong Python or Go skills for automation and observability

• Infrastructure-as-code experience (Terraform, Ansible, Helm)

• Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server)

• GPU resource management knowledge (MIG, NCCL, CUDA, containers)

• Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre)

• Linux systems engineering, CI/CD, and configuration management skills

• Strategic thinking with strong technical and business communication

• Organization, autonomy, adaptability

• Advanced English level


Desirable:

• Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration


Key responsibilities and your contribution

In this role, you’ll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale.


• Automate deployment, scaling, and lifecycle management of GPU clusters

• Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity

• Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers

• Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation

• Collaborate with teams to optimize performance, resources, and fault recovery at petascale

Top Skills

Ansible
Cuda
Ddn
Go
Gpfs
Helm
Kubernetes
Lustre
Mig
Nccl
Nvidia
Python
Slurm
Terraform
Vast
Weka
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Charlotte, North Carolina
848 Employees
Year Founded: 2011

What We Do

Dev.Pro helps innovative technology companies scale their business by leveraging our software engineering expertise to support them every step of the way.

It was founded by entrepreneurs and technologists, with the goal of helping technology-driven companies to develop their innovative software products and grow their businesses.

We started as an American company in 2011 and now have offices in different locations. Part of our development centers are located in Ukraine and we support Ukrainian specialists by providing them with career opportunities around the world. Also over the past few years, we have been hiring specialists from very different countries and continue to do so, expanding and globalizing the company.

True to our roots, we remain creative and nimble, tailoring our engagement with clients to meet their specific needs. Some come to us for our engineering expertise, some for the rapid delivery, and some for cost efficiency. But what truly sets us apart is the alliance we forge with our clients over time, aligning our success with theirs.

Similar Jobs

Articul8 AI Logo Articul8 AI

Senior Site Reliability Engineer

Artificial Intelligence • Software
Remote
2 Locations
58 Employees
Remote
30 Locations
393 Employees
179K-179K Annually
Remote
Brazil
150 Employees

SailPoint Logo SailPoint

Counsel

Artificial Intelligence • Cloud • Sales • Security • Software • Cybersecurity • Data Privacy
Remote or Hybrid
119 Locations
2461 Employees
151K-280K Annually

Similar Companies Hiring

Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees
PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account