Senior SRE Engineer (Observability Focus)

Posted Yesterday
Be an Early Applicant
3 Locations
In-Office or Remote
Senior level
Software
The Role
Design, deploy, and operate end-to-end observability at scale across hybrid AWS and on-prem environments. Own metrics (VictoriaMetrics/Prometheus), logs (OpenSearch), and traces (OpenTelemetry); run telemetry pipelines (OTEL Collector -> Kafka -> backends), manage Kafka and log shipping, build Grafana dashboards and alerts, tune sampling/batching, contribute to incident response and post-mortems, and mentor engineering teams on observability practices and structured logging.
Summary Generated by Built In

We are a leading trading platform that is ambitiously expanding to the four corners of the globe. Our top-rated products have won prestigious industry awards for their cutting-edge technology and seamless client experience. We deliver only the best, so we are always in search of the best people to join our ever-growing talented team.

We're building out our observability practice and need a senior engineer who can own it end to end. This is a hands-on role. You'll design and operate the telemetry stack that gives our engineering teams real visibility into production — across a hybrid AWS and on-premise environment, at scale.

Responsibilities:

  • Own the full observability stack: metrics (VictoriaMetrics), logs (OpenSearch), and traces (OpenTelemetry) — from pipeline design to day-2 operations.
  • Architect and run VictoriaMetrics cluster topology (vmstorage/vminsert/vmselect), including vmagent scraping, remote write configuration, vmalert rules, and cardinality control.
  • Operate OpenSearch clusters: index lifecycle management (ISM), hot-warm-cold architecture, shard tuning, and ingest pipelines via Data Prepper.
  • Build and maintain OTEL Collector pipelines — receivers, processors, exporters — and instrument services across Java, Python, and JS/TS stacks (auto and manual).
  • Run Kafka as the telemetry transport layer (OTEL Collector → Kafka → backends), including topic design, partition strategy, consumer group lag monitoring, and throughput tuning for high-volume telemetry.
  • Manage log shipping infrastructure using Fluent Bit, Vector, or Fluentd; define structured logging standards and field normalization across services.
  • Build Grafana dashboards and alerting that engineers actually use — clear, actionable, with well-structured variables and thresholds.
  • Work with platform and application teams to improve sampling strategies (head/tail), batching, and context propagation across distributed services.
  • Contribute to incident response, post-mortems, and reliability improvements driven by observability signals.
  • Mentor engineers on observability practices, tooling, and structured logging standards.

Requirements:

  • 6+ years in a DevOps, SRE, or platform engineering role, with at least 2 years focused on observability tooling at production scale.
  • Deep hands-on experience with VictoriaMetrics (or Prometheus) — MetricsQL/PromQL, exporters, service discovery, remote write, downsampling, and retention management.
  • Solid OpenSearch or Elasticsearch skills: cluster operations, Query DSL, ISM policies, and ingest pipeline design.
  • Production experience with OpenTelemetry: Collector configuration, OTLP, context propagation, and instrumentation across multiple languages.
  • Strong Kafka skills — producer/consumer patterns, consumer group management, Kafka Connect, Schema Registry, and JMX-based monitoring. Strimzi experience a plus if you've run Kafka on Kubernetes.
  • Proficiency with log shippers (Fluent Bit, Vector, Fluentd) and structured log parsing/normalization.
  • Working knowledge of Kubernetes (operators, Helm), Argo CD/GitOps, and Terraform/Ansible.
  • Comfortable in a hybrid AWS + on-prem environment; solid understanding of networking as it applies to scraping and shipping pipelines.
  • Scripting ability in Bash or Python for automation and tooling.
  • Strong communication skills — you can explain observability tradeoffs clearly to engineers and non-engineers alike.
  • English proficiency.

Skills Required

  • 6+ years in DevOps, SRE, or platform engineering with at least 2 years focused on observability tooling at production scale.
  • Deep hands-on experience with VictoriaMetrics or Prometheus including MetricsQL/PromQL, exporters, service discovery, remote write, downsampling, and retention management.
  • OpenSearch or Elasticsearch cluster operations, Query DSL, index lifecycle management (ISM) policies, and ingest pipeline design.
  • Production experience with OpenTelemetry: Collector configuration, OTLP, context propagation, and instrumentation across multiple languages.
  • Strong Kafka skills: producer/consumer patterns, consumer group management, Kafka Connect, Schema Registry, and JMX-based monitoring.
  • Strimzi experience running Kafka on Kubernetes.
  • Proficiency with log shippers (Fluent Bit, Vector, Fluentd) and structured log parsing/normalization.
  • Working knowledge of Kubernetes (operators, Helm), Argo CD/GitOps, and Terraform/Ansible.
  • Comfortable operating in hybrid AWS + on-prem environments with networking knowledge for scraping and shipping pipelines.
  • Scripting ability in Bash or Python for automation and tooling.
  • Strong communication skills and English proficiency.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
18 Employees

What We Do

Capital provides software that enables founders to raise, hold, spend, and send funds all in one place. Capital has evolved its flagship fundraising tool (formerly known as Party Round) to provide founders with banking solutions that streamline their startups.

Similar Jobs

Zapier Logo Zapier

Staff Engineer

Artificial Intelligence • Productivity • Software • Automation
Remote
32 Locations
800 Employees
211K-316K Annually

SEON Logo SEON

Senior Site Reliability Engineer

Artificial Intelligence • Cybersecurity
In-Office or Remote
28 Locations
415 Employees

Zapier Logo Zapier

Systems Engineer

Artificial Intelligence • Productivity • Software • Automation
Remote
27 Locations
800 Employees

Deepgram Logo Deepgram

Research Staff, LLMs

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
In-Office or Remote
49 Locations
150 Employees
150K-250K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account