Senior/Staff Infrastructure Engineer

Posted 13 Days Ago
Easy Apply
Be an Early Applicant
Hiring Remotely in Turkey
Remote
Senior level
Cloud • Digital Media • Information Technology
Generative media platform for developers.
The Role
The Senior/Staff Infrastructure Engineer designs and maintains tools for managing GPU servers, automating health checks, and optimizing system performance while driving resolutions for technical issues.
Summary Generated by Built In

You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You write systems and tooling for managing 1000s of servers including  provisioning, health monitoring, error detection, and recovery — and when something breaks that automation can’t fix, you drive resolution with partners.

Key responsibilities
  • Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc
  • Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting
  • Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals)
  • Leverage AI to an extreme level to build tools and automate alerting and recovery
  • Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation
  • Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage
  • Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)
  • Develop a suite of automated error detection and recovery processes
  • Work with partners to solve technical issues
Requirements
  • 5+ years experience managing bare-metal and VM server fleets at scale (100+ nodes)
  • Strong software engineering skills in Python; you write production tooling, not scripts
  • Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling
  • Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init
  • Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning
  • Familiarity with hardware diagnostics and failure modes (GPUs, NVMe, NICs, memory)
  • Experience building internal tools or dashboards for infrastructure visibility
  • Excellent communication and ability to drive technical decisions across teams
  • Self-starter who executes quickly, takes ownership, and constantly seeks improvement
Nice to have
  • Familiarity with network configuration and diagnostics (VLAN, VXLAN, ECMP, BGP, tcpdump)
  • Experience with NVIDIA GPU infrastructure: driver management, health monitoring, DCGM, NVLink/NVSwitch diagnostics, RDMA, InfiniBand/RoCEv2
  • Experience with AMD GPUs
  • Experience with bare metal and VM provisioning (PXE/iPXE, Kickstart, libvirt, Qemu/KVM)
  • Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001)
Location
  • Turkey

What we offer at fal
  • Interesting and challenging work
  • A lot of learning and growth opportunities
  • Regular team events and offsites

Top Skills

Ansible
Cuda
Docker
Gpfs
Lustre
Nfs
Nvme
Python
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
73 Employees

What We Do

Generative Media Cloud

Similar Jobs

SciPlay Logo SciPlay

Senior 3D Game Artist - Match Hotel! (Istanbul & Ankara)

Gaming • Marketing Tech • Mobile • Software • App development
In-Office or Remote
2 Locations
1000 Employees

Cencora Logo Cencora

Bilgi Sistemleri Stajyeri

Healthtech • Logistics • Pharmaceutical
Remote
İstanbul, Şişli, İstanbul, TUR
51000 Employees

GitLab Logo GitLab

Senior Backend (Go) Engineer, Gitlab Delivery -Operate

Cloud • Security • Software • Cybersecurity • Automation
Easy Apply
Remote
31 Locations
2500 Employees

Boeing Logo Boeing

Intern - Business (Intern-Business-General)

Aerospace • Information Technology • Software • Cybersecurity • Design • Defense • Manufacturing
Remote
Ankara, Çankaya, Ankara, TUR
170000 Employees

Similar Companies Hiring

Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account