Bitdeer Group

Sr. SRE Platform Software Engineer

Reposted 9 Days Ago

Be an Early Applicant

San Jose, CA, USA

In-Office

Senior level

Software

The Role

Lead design and implement a global public cloud SRE platform for AI and compute workloads. Own architecture and production engineering for observability, cluster health, remediation, lifecycle, secrets, CI/CD, backup/DR, and automation. Collaborate with cross-functional teams to build scalable, reliable multi-region services and run them in production (on-call).

Summary Generated by Built In

About Bitdeer Technologies Group

Bitdeer is a world-leading technology company for AI and Bitcoin mining infrastructure.

Bitdeer is committed to providing comprehensive Bitcoin mining solutions for its customers and building AI computational infrastructure to support the AI revolution. Bitdeer handles complex processes involved in computing such as equipment procurement, transport logistics, data center design and construction, equipment management, and daily operations. Bitdeer also offers advanced cloud capabilities to customers with high demand for artificial intelligence.

Headquartered in Singapore, Bitdeer has deployed data centers across multiple countries, including the United States, Norway, Bhutan, and Ethiopia.
To learn more, visit https://ir.bitdeer.com/

Position Overview

Build and operate one or more bounded contexts of the NeoCloud SRE platform — the multi-region substrate that observes, protects, and operates a GPU rental fleet across self-built and OEM-rented data centers. You take an architect-approved design and turn it into production code that ships through GitOps + the CICD release pipeline, ride the Plugin Framework conventions, meet declared SLOs, and stay drift-free.

This is the build + run role. You don't only ship code; you ship a service that other squads, cloud-service teams, and tenants depend on. You take the on-call pager for what you build.

Key Responsibilities
You will own 1-2 of these:

Collection & Storage: collection-agent, customer-sdk-gateway, metrics-store, logs-store, traces-store, profiles-store, analytics-lake, enrichment-service, collection-monitor.
Alert, Correlation & SLO: alert-engine-framework, alert-correlation, slo-framework, default M-series alert rules.
Topology, Cluster-Health & Cluster Platform Services: topology-service, cluster-health-rollup, OSS-SRE-tool collection plugins for K8s, Slurm, Ray, Volcano, Kueue, and KubeRay.
Fault-Prediction: prediction-engine-framework and built-in predictors (GPU, Link, Disk, XPA, Straggler, SDC, Stranded GPU).
Remediation, Workflow, Inspection & Jobs: remediation-actuator, orchestration-substrate (workflow engine), inspection-orchestrator, job-scheduler, NCCL-baseline inspection probe.
Hardware Lifecycle & DC Ops: hardware-lifecycle, dc-operations, boot-provisioning, rolling-upgrade, bare-metal-bmc-service, auto-discovery, ZTP D0–D5 pipeline, IPMI bare-metal management.
Identity, Secrets, Tenant-Config & CMDB: iam-service, secrets-service, tenant-sre-config, cmdb-cache, schema registry.
Customer-Bridge, Ticketing & SRE Platform Portal: customer-bridge, customer-ticketing, sre-operation-system, Customer Console BFF, SRE Console BFF.
Backup, DR & Meta-Monitor: backup-orchestrator, meta-monitor, external-watcher integration (Datadog or equivalent).
CI/CD, GitOps, Plugin Framework & SRE Image Registry: cicd-pipeline, gitops-sync, plugin-registry, sre-image-registry.
Self-Improving Agent: agent-control-plane, agent-discovery, agent-codegen, agent-sandbox, per-Region LLM gateway.
Global SRE Management: maintenance-window-orchestrator, change-management, capacity-planner, cost-optimizer, gpu-efficiency-dashboard, network-stability-dashboard, patching-orchestrator, artifact-management, compat-matrix-service, security-platform.

Qualifications

Software Engineering Experience: 7+ years of production software engineering experience, including 2 or more years operating what you built (real on-call experience, not just shipping code).
Programming Languages: Production-depth mastery of at least one systems-grade language—Go (preferred), Rust, or Java. Proficiency in Python for tooling and SDK work.
Distributed Systems Fundamentals: Strong grasp of at-least-once vs. exactly-once trade-offs, idempotency, back-pressure, leader election, consistent hashing, gossip, and fan-out. Ability to evaluate CRDT vs. Raft vs. Paxos and select the right tool for the job.
Multi-Region Observability Stack: Experience at production scale with Prometheus, VictoriaMetrics, Mimir, Thanos, Loki, Elasticsearch, Tempo, Jaeger, or OpenTelemetry. Must have built or substantively contributed to the ingest, query, or storage paths of these systems.
GitOps & CI/CD: Hands-on experience with Argo, Flux, Helm, Kustomize, Cosign signing, signed-bundle promotion, and blast-radius-aware rollouts.
Kubernetes Operator Pattern: Proven experience writing a controller or CRD handling real production traffic, with a deep understanding of watch-cache mechanics, leader election, and reconcile loops.
mTLS & Secrets Management: Experience executing end-to-end mTLS bootstrap with certificate rotation. Hands-on experience with HashiCorp Vault or cloud KMS (AWS KMS / GCP KMS).
SQL & Time-Series Data: Ability to read a Prometheus query plan, build a recording-rule strategy, and write SQL that joins per-tenant telemetry against analytics-lake tables.
Testing Discipline: Rigorous approach to unit, integration, contract, chaos, and soak testing. Experience writing and maintaining your own comprehensive tests.
Technical Writing Fluency: Ability to author clear design docs that align with existing platform architecture, create runbooks optimized for 3 AM on-call responses, and write intent-driven PR descriptions.

Preferred Qualifications (GPU / AI-Infra Context)
Experience in at least one of the following areas is a strong plus:

NVIDIA Internals: Deep understanding of DCGM and NVIDIA driver internals, including XID semantics and MIG / vGPU partitioning.
Networking & Fabrics: Experience with InfiniBand or RoCE fabrics, including subnet managers, partitioning, optical health, and NCCL collective tracing.
HPC Storage: Experience managing Lustre, NetApp, Pure, DDN, VAST, or NVMe-oF under multi-tenant loads.
Hardware Management: Hands-on experience with BMC, IPMI, and Redfish at OEM scale (Supermicro, Dell, HPE, Lenovo).
Cluster Platform Internals: Familiarity with Kubernetes GPU Operator, Slurm controller, or Ray GCS.
BS/MS in Computer Science or similar
Hyperscale or NeoCloud experience

--------------------------------------------------------------------

Bitdeer is committed to providing equal employment opportunities in accordance with country, state, and local laws. Bitdeer does not discriminate against employees or applicants based on conditions such as race, color, gender identity and/or expression, sexual orientation, marital and/or parental status, religion, political opinion, nationality, ethnic background or social origin, social status, disability, age, indigenous status, and union.

Skills Required

7+ years production software engineering experience with 2+ years operating what you built (on-call)
Production-depth mastery of at least one systems language (Go preferred; Rust or Java acceptable)
Proficiency in Python for tooling and SDK work
Strong distributed systems fundamentals (idempotency, leader election, consistent hashing, CRDT vs Raft vs Paxos knowledge)
Production experience with observability/ingest/query/storage paths for systems like Prometheus, VictoriaMetrics, Thanos, Loki, Elasticsearch, Tempo, Jaeger, or OpenTelemetry
Hands-on GitOps and CI/CD with Argo or Flux, Helm, Kustomize, and Cosign (signed bundles, rollout strategies)
Proven experience writing a Kubernetes controller/CRD handling production traffic (operator pattern, watch-cache, reconcile loops)
Experience executing end-to-end mTLS bootstrap with certificate rotation and secrets management (HashiCorp Vault or cloud KMS such as AWS KMS/GCP KMS)
Ability to read Prometheus query plans, design recording-rule strategies, and write SQL joining telemetry to analytics tables
Rigorous testing discipline across unit, integration, contract, chaos, and soak tests
Strong technical writing: design docs, on-call runbooks, intent-driven PRs
Experience contributing to or owning CI/CD, GitOps, and image/plugin registries in production
Experience with hardware lifecycle, bare-metal provisioning, BMC/IPMI/Redfish at scale
Familiarity with GPU/AI infra internals (NCCL, DCGM, NVIDIA driver XID, MIG/vGPU)
Experience with HPC fabrics (InfiniBand, RoCE) and storage under multi-tenant load (Lustre, NetApp, Pure, DDN, VAST, NVMe-oF)
Familiarity with Slurm, Ray, Kubernetes GPU Operator, or other cluster controllers
BS/MS in Computer Science or similar
Hyperscale or NeoCloud experience

View all jobs at Bitdeer Group

View Bitdeer Group Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Singapore

214 Employees

What We Do

Bitdeer Technologies Group (Nasdaq: BTDR) is a leader in the blockchain and high-performance computing industry. It is one of the world’s largest holders of proprietary hash rate and suppliers of hash rate. Bitdeer is committed to providing comprehensive computing solutions for its customers. The company was founded by Jihan Wu, an early advocate and pioneer in cryptocurrency who cofounded multiple leading companies serving the blockchain economy. Mr. Wu leads the company as Founder, Chairman, and CEO. Linghui Kong serves as Bitdeer’s CBO and provides leadership through deep industry knowledge and technology expertise. Headquartered in Singapore, Bitdeer has deployed mining datacenters in the United States, Norway, and Bhutan. It offers specialized mining infrastructure, high-quality hash rate sharing products, and reliable hosting services to global users. The company also offers advanced cloud capabilities for customers with high demands for artificial intelligence. Dedication, authenticity, and trustworthiness are foundational to our mission of becoming the world’s most reliable provider of full-spectrum blockchain and high-performance computing solutions. We welcome global talent to join us in shaping the future