Observability Engineer

Reposted 2 Days Ago
Be an Early Applicant
Las Vegas, NV, USA
In-Office
Mid level
Artificial Intelligence • Cloud • Software
The Role
The Observability Engineer will own the observability stack, manage dashboards and alerts, and partner with teams to improve system metrics, ensuring reliable AI infrastructure.
Summary Generated by Built In

Our mission at TensorWave Cloud is to build seamless, secure, reliable, and resilient AI infrastructure at scale, eliminating barriers and challenging the status quo to empower builders and support AI innovation.

About the role

We are looking for an Observability Engineer who is deeply obsessed with Grafana, Prometheus, and modern observability practices. This role exists to ensure our systems are measurable, understandable, and debuggable at all times.

You will own the observability stack end-to-end — from instrumentation standards to dashboards, alerts, and signal quality — and work closely with infrastructure, platform, and application teams to make sure nothing important fails silently.

If you think about metrics before features, believe bad alerts are worse than no alerts, and treat Grafana dashboards as first-class products, this role is for you.

Responsibilities

  • Own and evolve our observability and monitoring platform, with Grafana and Prometheus at its core

  • Design, build, and maintain high-quality metrics pipelines using Prometheus and related tooling

  • Create clear, actionable Grafana dashboards that tell a story — not just charts

  • Define and maintain alerts that are meaningful, actionable, and low-noise

  • Establish and enforce observability standards across services (metrics, logs, traces)

  • Partner with engineering teams to instrument applications correctly

  • Lead improvements to alerting strategies, SLOs, and SLIs

  • Support incident response by helping teams quickly understand what broke and why

  • Continuously evaluate and improve signal quality, cardinality, and cost

  • Identify observability gaps and eliminate blind spots before they become outages

You Are Obsessed With:

  • Grafana dashboards that instantly explain system health

  • Prometheus metrics that are intentionally designed, not accidental

  • Alerts that wake people up only when action is required

  • Monitoring that scales with system complexity

  • Observability as a product, not an afterthought

Required Experience

  • Strong hands-on experience with Grafana and Prometheus

  • Deep understanding of metrics-based observability

  • Experience designing monitoring and alerting systems at scale

  • Strong knowledge of alerting best practices (burn rates, SLO-based alerts, noise reduction)

  • Experience working with distributed systems and cloud or Kubernetes environments

  • Ability to reason about system behavior using telemetry

  • Comfortable working across teams to improve instrumentation and visibility

Preferred Experience

  • Experience with OpenTelemetry

  • Familiarity with logs and traces (Loki, Tempo, Jaeger, etc.)

  • Kubernetes observability experience

  • Experience operating observability systems in high-scale or production-critical environments

  • Infrastructure-as-Code experience (Terraform, Helm, etc.)

What We Bring

  • Mission driven company

  • Competitive Salary

  • Stock Options

  • 100% paid Medical, Dental, and Vision insurance

  • Life and Voluntary Supplemental Insurance

  • Short Term Disability Insurance

  • Flexible Spending Account

  • 401(k)

  • Flexible PTO

  • Paid Holidays

  • Parental Leave

  • Mental Health Benefits through Spring Health

We’re looking for resilient, adaptable people to join our team, people who believe in the mission and think at massive scale. The solutions that worked on a handful of devices will not work at Exascale. Be prepared to be pushed daily, to learn a lot, and literally build the future.

TensorWave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, national origin, or veteran status.

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Las Vegas, Nevada
56 Employees

What We Do

TensorWave is a cutting-edge cloud platform designed specifically for AI workloads. Offering AMD MI300X accelerators and a best-in-class inference engine, TensorWave is a top-choice for training, fine-tuning, and inference. Visit tensorwave.com to learn more. Send us a message to try it for free.

Similar Jobs

MongoDB Logo MongoDB

Staff Software Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
2 Locations
5550 Employees
173K-297K Annually

CrowdStrike Logo CrowdStrike

Consultant

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
2 Locations
10000 Employees
130K-200K Annually

CrowdStrike Logo CrowdStrike

AIDR SE Specialist - Public Sector/Healthcare (Remote)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
USA
10000 Employees
135K-205K Annually

CrowdStrike Logo CrowdStrike

Operations Associate

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
2 Locations
10000 Employees
80K-105K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Hardware • Other • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account