Judgment Labs

Senior Data Infrastructure Engineer

Reposted 10 Days Ago

San Francisco, CA, USA

In-Office

Senior level

Artificial Intelligence • Software

The Role

Build and scale real-time data pipelines processing 100k+ traces/sec, run LLM-based scoring and clustering near-real time, optimize LLM serving and ClickHouse OLAP performance, and own infrastructure roadmap from ingestion through analytics.

Summary Generated by Built In

Judgment Labs builds infrastructure for Agent Behavior Monitoring (ABM). While traditional observability focuses on logging exceptions and latency, our ABM surfaces behavioral anomalies such as instruction drifts and context retrieval loss in scaled production environments.

Hundreds of teams building autonomous agents rely on Judgment to understand how their systems behave post-deployment. Instead of reactive incident triage, they cluster patterns across conversations and workflows, correlate regressions to specific interaction types, and pinpoint where reliability breaks down. We've raised $30M+ across two rounds in the past five months from investors including Lightspeed, SV Angel, and Valor Equity Partners.

We’ve raised $30M+ across two rounds in the past five months. Our investors include Lightspeed, SV Angel, Valor Equity Partners, Nova Global, Chris Manning, Michael Ovitz, Michael Abbott, Cory Levy, Kevin Hartz, and others.

The Role:

We are looking for a Senior Data Infrastructure Engineer to build and scale the real-time data pipelines that power agent behavior analysis at production scale. This role is crucial for processing hundreds of thousands of traces per second, running LLM-based scoring and clustering in near-real time, and delivering the low-latency query performance that enables teams to understand agent behavior as it happens. We need someone who has built petabyte-scale data systems, knows how to squeeze performance out of OLAP databases, and can own the data infrastructure from ingestion through analytics.

What You'll Do:

Design and automate large-scale, high-performance streaming and batch data processing systems to power Judgment's behavioral analysis products.
Partner closely with infrastructure and backend partners to improve scalability, data governance, and efficiency.
Evangelize high-quality software engineering practices for data infrastructure at scale.
Advocate for a high bar on data and engineering quality: reliable, efficient, well-documented, testable, and maintainable.
Design data models for optimal storage and access, with thoughtful data flows to power critical product requirements.
Optimize OLAP database performance through schema design, partitioning strategies, storage tiering, and access pattern analysis.

What We're Looking For:

6+ years of relevant industry experience building and operating high-throughput, petabyte-scale data pipelines in production.
Experience collaborating with infrastructure, backend, and product partners to align on data flow and system design.
Experience designing and deploying high-performance systems with reliable monitoring and observability practices
Deep expertise with streaming and batching systems (Kafka, Spark, Flink, or Ray) operating at petabyte scale.
Hands-on OLAP database engineering experience, including with columnar databases (ClickHouse or similar) and distributed query engines (Presto or similar)
Excellent communication skills, both written and verbal

Nice to have:

Experience building pipelines that call LLM APIs at scale: request batching, rate limit management, cost optimization.
Familiarity with ML workflow orchestration (Airflow, Dagster, Prefect).
Experience with embedding generation pipelines or vector search infrastructure.
Background in observability, log processing, or event stream platforms (Datadog, Honeycomb, Sentry).
Data quality monitoring and anomaly detection within pipelines

Why Judgment?

Agents can’t work without this. Today’s agents hallucinate, drift, and break in production. We’re building the infrastructure that fixes this: the monitoring layer that makes agents self-improving.
We’re wired to win. We're a team of less than 20 but we ship like 50+ on the daily. You'll be working with olympiad medalists, debate champions, and competitive athletes who bring that same intensity to company building.
Fast track to founding. Our engineers interface directly with customers, ship code into their environments, and use their feedback to dictate what’s next on the roadmap. Everyone on the team is either an ex-founder or a founder-to-be.
We make sure our people do their best work. If you deserve a spot on the team, money will never get in the way of it. Full benefits, Equinox, and a private chef to take care of you. We sprint hard but we play hard, ask us about our Smash/Mario Kart tournaments.
We work in person in San Francisco.

Skills Required

Experience building and tuning high-throughput petabyte-scale data pipelines
Deep knowledge of data infrastructure (Apache Spark, Ray, dbt, Airflow/Dagster)
Experience with OLAP database engineering (ClickHouse)
Comfortable with cloud infrastructure and batch + streaming pipelines
Design streaming pipelines to score and cluster 100k+ traces/s using LLM APIs
Senior-level ownership of infrastructure roadmap, architecture design, and shipping fixes
Ability to analyze LLM serving bottlenecks (flamegraphs) and improve RPS via batching and concurrency
On-site work in San Francisco
Experience with LLM inference and serving optimizations (speculative decoding, continuous/dynamic batching, KV cache management)
Familiarity with quantization techniques (INT8, INT4), multi-GPU serving, and tensor parallelism
Background from observability companies, trading, RecSys/ML big tech, or AI labs

View all jobs at Judgment Labs

View Judgment Labs Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: San Francisco, California

20 Employees

Year Founded: 2025

What We Do

Judgment Labs builds agent behavior monitoring (ABM) infrastructure. Judgment provides a toolkit to track and judge agent behavior in online and offline setups, enabling you to convert high-signal interaction data from production/test environments into more reliable agents.