Sr. SRE Platform Architect

Posted 8 Days Ago
Be an Early Applicant
San Jose, CA, USA
In-Office
Senior level
Software
The Role
Lead architecture, design, and evolution of a global multi-region cloud SRE platform for GPU/AI compute. Author and maintain platform architecture, enforce design invariants, review framework changes, run plugin framework, decide tier placements, coordinate with cloud teams and security, produce pre-flight designs, and shepherd implementations through engineering squads.
Summary Generated by Built In

About Bitdeer Technologies Group

Bitdeer is a world-leading technology company for AI and Bitcoin mining infrastructure.

Bitdeer is committed to providing comprehensive Bitcoin mining solutions for its customers and building AI computational infrastructure to support the AI revolution. Bitdeer handles complex processes involved in computing such as equipment procurement, transport logistics, data center design and construction, equipment management, and daily operations. Bitdeer also offers advanced cloud capabilities to customers with high demand for artificial intelligence.

Headquartered in Singapore, Bitdeer has deployed data centers across multiple countries, including the United States, Norway, Bhutan, and Ethiopia.

Position Overview

Bitdeer is seeking a visionary and hands-on Cloud SRE Architect to lead the design, development, and evolution of our next-generation public cloud platform. This role will oversee the end-to-end architecture across CPU, GPU, RDS, storage, networking, serverless, and AI services, ensuring global scalability, reliability, and performance. The ideal candidate is a strategic thinker with deep technical expertise in cloud infrastructure, platform engineering and AI systems, capable of bridging architecture vision with real-world engineering execution. You will collaborate closely with cross-functional teams and global partners to define our cloud technology roadmap, optimize multi-region deployments, and deliver world-class infrastructure and platform solutions that power large-scale AI and enterprise workloads.

Key Responsibilities

Own the end-to-end architecture of the NeoCloud SRE platform — the substrate that observes, protects, and operates a multi-region GPU rental fleet across self-built and OEM-rented data centers. You are the single point of architectural accountability across the platform's ~57 bounded contexts, ~12 frameworks, and three operational tiers (Edge DC → Regional Controller → Global Hub).

This role is for someone who writes the design, defends it under review, and shepherds it through the engineering squads that build it.

What You'll Do

  1. Write and maintain the platform architecture document — keep the design coherent across all sections, frameworks, and tiers. The current document is your starting point.
  2. Review every framework-level change — new bounded context, new plugin kind, tier-deployment shift, schema change, naming change, cross-context contract change. Architecture changes ride GitOps PRs like any other artifact.
  3. Set design invariants — residency rules (raw data stays in Region), Tier 2 self-sufficiency budget (≥ 24 h), survival-uplink contracts, naming conventions, SLO catalogues, redaction-at-boundary rules.
  4. Run the plugin framework — every extension uses one uniform contract (Common + Domain manifest, lifecycle, observability). You author and evolve this contract.
  5. Decide tier placement — what runs at Edge DC vs Regional Controller vs Global Hub, with data-residency / compliance / availability tradeoffs explicit.
  6. Coordinate with cloud-service teams and tenants — they author plugins, SDKs, dashboards, agent recipes that ride the platform. You set the contracts they consume.
  7. Coordinate with Security — joint ownership of vulnerability management, exposure management, joint operations. Security owns policy and risk acceptance; you own the operational mechanisms they ride.
  8. Pre-flight roadmap items — for any new capability, produce a one-page design that fits the existing layered model (L1–L6), tier topology, naming conventions, and extension contracts before implementation starts.
  9. Defend the design under review — say no to scope creep, special-case workarounds, and one-off integrations that don't fit the framework model. Say yes when a new plugin kind is genuinely needed.

Qualifications

  • 10+years of production SRE / platform-engineering / infra-architecture, including ≥ 3 years at architect level.
  •  Hands-on with GPU / AI-compute infrastructure — NVIDIA GPU ops (DCGM, MIG, vGPU, NVLink/NVSwitch, XID semantics, NCCL), InfiniBand or RoCE fabrics (subnet manager, fabric partitioning, optical health), HPC storage (Lustre, NetApp/Pure/DDN/VAST, NVMe-oF).
  • Multi-region observability at scale — metrics / logs / traces / profiles / analytics-lake substrate; recording rules, MWMBR burn-rate alerting, SLI/SLO discipline.
  • Cluster platforms — first-hand experience with Kubernetes (control plane + GPU Operator + topology-aware scheduling) AND at least one of Slurm / Volcano / Kueue / Ray / KubeRay.
  • Data-center operations — ZTP, BMC/IPMI/Redfish, BIOS/firmware lifecycle, RMA, multi-vendor OEM management (self-built + leased DC mix).
  • Strong DDD instincts — bounded contexts, public contracts, no shared databases, one-context-one-repo discipline.
  • Plugin framework design — you have built (or substantively contributed to) a real extension framework with a uniform manifest + lifecycle.
  • Writing fluency — you can author and maintain a multi-thousand-line architecture document under review without it drifting; you can also write a one-pager an executive will read.
  • Cross-team operating tempo — design reviews, runbook authorship, on-call shadowing, post-mortem facilitation
  • Hyperscale or NeoCloud experience
  • BS/MS in Computer Science or similar

--------------------------------------------------------------------

Bitdeer is committed to providing equal employment opportunities in accordance with country, state, and local laws. Bitdeer does not discriminate against employees or applicants based on conditions such as race, color, gender identity and/or expression, sexual orientation, marital and/or parental status, religion, political opinion, nationality, ethnic background or social origin, social status, disability, age, indigenous status, and union.

Skills Required

  • 10+ years production SRE / platform-engineering / infra-architecture experience, including ≥3 years at architect level
  • Hands-on NVIDIA GPU / AI-compute infrastructure (DCGM, MIG, vGPU, NVLink/NVSwitch, XID semantics, NCCL)
  • InfiniBand or RoCE fabric experience (subnet manager, fabric partitioning, optical health)
  • HPC storage experience (Lustre, NetApp, Pure, DDN, VAST, NVMe-oF)
  • Multi-region observability at scale (metrics, logs, traces, profiles, SLI/SLO discipline)
  • Cluster platform experience: Kubernetes (control plane + GPU Operator + topology-aware scheduling) and at least one of Slurm, Volcano, Kueue, Ray, KubeRay
  • Data-center operations experience (ZTP, BMC/IPMI/Redfish, BIOS/firmware lifecycle, RMA, multi-vendor OEM management)
  • Strong Domain-Driven Design instincts (bounded contexts, public contracts, one-context-one-repo)
  • Plugin framework design or substantial contribution (uniform manifest + lifecycle)
  • Writing fluency for long-form architecture docs and concise one-pagers
  • Cross-team operating tempo: design reviews, runbooks, on-call shadowing, post-mortems
  • Hyperscale or NeoCloud platform experience
  • BS/MS in Computer Science or similar
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Singapore
214 Employees

What We Do

Bitdeer Technologies Group (Nasdaq: BTDR) is a leader in the blockchain and high-performance computing industry. It is one of the world’s largest holders of proprietary hash rate and suppliers of hash rate. Bitdeer is committed to providing comprehensive computing solutions for its customers. The company was founded by Jihan Wu, an early advocate and pioneer in cryptocurrency who cofounded multiple leading companies serving the blockchain economy. Mr. Wu leads the company as Founder, Chairman, and CEO. Linghui Kong serves as Bitdeer’s CBO and provides leadership through deep industry knowledge and technology expertise. Headquartered in Singapore, Bitdeer has deployed mining datacenters in the United States, Norway, and Bhutan. It offers specialized mining infrastructure, high-quality hash rate sharing products, and reliable hosting services to global users. The company also offers advanced cloud capabilities for customers with high demands for artificial intelligence. Dedication, authenticity, and trustworthiness are foundational to our mission of becoming the world’s most reliable provider of full-spectrum blockchain and high-performance computing solutions. We welcome global talent to join us in shaping the future

Similar Jobs

Relativity Space Logo Relativity Space

Mechanisms Technician II, First Shift

Aerospace • Hardware • Robotics • Software • Manufacturing
Easy Apply
In-Office
Long Beach, CA, USA
2200 Employees
27-40 Annually

Relativity Space Logo Relativity Space

Responsible Engineer, Power

Aerospace • Hardware • Robotics • Software • Manufacturing
Easy Apply
In-Office
Long Beach, CA, USA
2200 Employees
148K-222K Annually

Relativity Space Logo Relativity Space

Senior Hardware Engineer

Aerospace • Hardware • Robotics • Software • Manufacturing
Easy Apply
In-Office
Long Beach, CA, USA
2200 Employees
148K-222K Annually

CrowdStrike Logo CrowdStrike

Senior Software Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Hybrid
2 Locations
10000 Employees
160K-250K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account