Infrastructure Engineer (Observability)

Reposted 25 Days Ago
2 Locations
In-Office or Remote
140K-180K Annually
Senior level
Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
We build infrastructure for machine learning
The Role
The Infrastructure Engineer focuses on observability, designing platforms for metrics and alerting, creating dashboards, deploying telemetry, and collaborating across teams to enhance reliability and transparency.
Summary Generated by Built In

Voltage Park is seeking an Infrastructure Engineer with a focus on Observability to join our Infrastructure Engineering team. Our engineers design and operate the systems that manage thousands of bare-metal servers, GPUs, and high-performance networks across multiple data centers.

This role combines the breadth of a core infrastructure engineer with a specialty in observability and telemetry. You’ll design and operate metrics, logs, traces, and alerting pipelines that provide actionable insights for both internal teams and external customers — helping to ensure reliability and transparency at scale.

This is a fully remote position, although candidates must be based in the continental United States. Unfortunately, we are unable to provide sponsorship for this role.

Responsibilities
  • Design, build, and maintain observability platforms spanning metrics, logs, traces, and events.

  • Create dashboards and alerting for internal stakeholders (InfraOps, Engineering, Customer Success) and scoped visibility for external customers.

  • Ingest and correlate telemetry from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish.

  • Implement noise-resistant alerting pipelines that improve detection and reduce operational load.

  • Collaborate with infrastructure, platform, and customer-facing teams to embed observability into workflows.

  • Contribute to broader infrastructure engineering projects beyond observability.

Qualifications
  • 8+ years in infrastructure engineering, SRE, or observability roles.
    Strong experience with monitoring systems (Prometheus, Grafana, ELK, VictoriaMetrics, or similar).

  • Proficiency in Python, Go, or bash for automation and data integration.

  • Familiarity with container/Kubernetes observability.

  • Understanding of streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent).

  • Strong written and verbal communication skills.

Ideal Experiences
  • Experience with GPU observability, particularly NVIDIA DCGM.

  • Designing multi-tenant observability solutions with RBAC and scoped queries.

  • Prior work with correlation engines for RCA, forecasting, or predictive alerting.

  • Broader exposure to infrastructure domains (networking, storage, provisioning).

Culture
  • You enjoy working with a small, highly motivated team.

  • You’re comfortable balancing autonomy with company-wide priorities.

  • You value clarity, documentation, and actionable insights in observability systems.

You’re excited to specialize in observability while contributing as a core infrastructure engineer.

Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law.

Voltage Park is an equal opportunity employer and makes employment decisions on the basis of merit. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. If you require an accommodation during the job application process, please notify your recruiter. 

Compensation Range: $140K - $180K


#BI-Remote

Top Skills

Bash
Bmc/Redfish
Elk
Go
Grafana
Kafka
Kubernetes
Otel
Prometheus
Promtail
Python
Victoriametrics

What the Team is Saying

Melissa Du
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
51 Employees
Year Founded: 2023

What We Do

The market for cutting-edge ML compute is broken. Startups, researchers and even big AI labs are scrambling to buy or rent access to the latest chips for ML training. But demand far outstrips supply, and what’s available is only accessible to the well-resourced, placing an artificial damper on innovation.

To solve this challenge, we've launched Voltage Park, and we’re on a mission to make machine learning infrastructure accessible to all, from large enterprises and research universities, to seed-stage startups and nonprofits.

With around 24,000 NVIDIA H100 GPUs, the Voltage Park cloud is one of the most powerful collections of cutting-edge ML compute in the world. Our clusters consist of 80GB H100 SXM5 GPUs fully interconnected with 3.2T InfiniBand.

Why Work With Us

You’ll play a pivotal role as a member of the founding team that will change the face of machine learning infrastructure. As an early hire, you’ll have outsize influence in defining the company’s culture and ensuring mission success.

Gallery

Gallery
Gallery
Gallery
Gallery

Voltage Park Offices

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

Typical time on-site: Flexible
HQSan Francisco, CA

Similar Jobs

Voltage Park Logo Voltage Park

Storage Engineer

Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
Remote
USA
150K-180K Annually

Voltage Park Logo Voltage Park

Revenue Operations Manager

Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
In-Office or Remote
2 Locations
180K-215K Annually

Voltage Park Logo Voltage Park

Legal Assistant

Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
In-Office or Remote
2 Locations
70K-80K Annually

Voltage Park Logo Voltage Park

Special Teams Leader

Artificial Intelligence • Cloud • Hardware • Machine Learning • Other • Software • Infrastructure as a Service (IaaS)
Remote
USA
150K-180K Annually

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account