Traversal

AI Engineer - Data Platform

Reposted 13 Days Ago

Easy Apply

New York, NY, USA

In-Office

150K-300K Annually

Mid level

Software

Our AI SRE agent is on call, so you don’t have to be.

The Role

Design, build, and maintain AI-driven backend systems for observability, focusing on infrastructure development and optimization across cloud and on-premise environments.

Summary Generated by Built In

About Traversal

Traversal is the AI Site Reliability Engineer (SRE) for the enterprise—already trusted by some of the largest companies in the world to troubleshoot, remediate, and even prevent the most complex production incidents. Our mission is to free engineers from endless firefighting and enable them to focus on creative, high-impact work.

Our roots remain deeply embedded in AI research, and we’re channeling that scientific rigor and creativity into building the premier AI agent lab for the enterprise. Hence, what we’re proudest of is assembling the most talented yet nicest group of individuals, including researchers from MIT, Harvard, and Berkeley, to world-class engineers from industry: Citadel Securities, Cockroach Labs, Datadog, DE Shaw, ServiceNow, Glean, Perplexity, Pinecone, and more, to take on one of the hardest problems for AI to solve. Without the entire team, none of this would be possible.

The Role

As an Infrastructure Engineer on the Data Platform team at Traversal, you’ll design, build, and maintain the backend systems that power our AI-driven observability platform. You’ll work across both cloud and on-prem deployments, ensuring our systems are highly reliable, performant, and capable of supporting large-scale AI operations. This hands-on role blends distributed systems engineering, low-level system design, performance optimization, observability, and AI integration—collaborating closely with engineers across the company to deliver resilient infrastructure that enables our AI agents to diagnose and remediate production incidents in real time.

Responsibilities

Architecture & Implementation: Contribute to the design and implementation of scalable, resilient infrastructure systems that power AI-driven root cause analysis and observability workflows across diverse on-premises environments.
Low-Level System Design: Work on the foundational building blocks of our infrastructure, ensuring efficient use of resources and high performance at scale.
Performance Optimization: Profile and tune backend systems to improve throughput, reduce latency, and minimize bottlenecks across the stack.
Observability Systems: Help build and maintain the internal observability stack—logs, metrics, and traces—used by our agents to understand and act on production issues.
Hybrid Infrastructure: Support architectures for both cloud-hosted (SaaS) and on-prem deployments to serve enterprise customers.
Data Infrastructure: Develop and maintain low-latency, high-throughput pipelines using tools like Kafka, Postgres, and S3 for real-time telemetry workflows.
Tooling & Automation: Contribute to infrastructure-as-code, CI/CD tooling, and deployment systems to increase platform velocity and stability.
Cross-Team Collaboration: Work with AI, platform, and product teams to ensure smooth integration and shared reliability goals.
Using Traversal Internally: Help ensure our own observability tooling supports how we debug, monitor, and operate our systems.

Requirements

Professional experience with Rust (our primary language for infrastructure), or strong systems-level programming experience in OCaml, C++, C or Zig.
Experience building distributed systems using a variety of application-appropriate datastores (e.g., Postgres, object storage, etc.).
Strength in debugging across cloud infrastructure, networking layers, and production systems (instrumentation, provisioning, bug fixes, reliability improvements).
Experience with performance profiling and optimization in backend systems.
Exposure to low-level system design concepts (e.g., concurrency models, storage internals, OS, and DB level tuning).

Nice to Have

Experience making complex software systems observable using logs, metrics, and traces.
Familiarity with Python-based ecosystems.
Background in large-scale, complex, data-driven applications, and familiarity with event streaming platforms such as Kafka.
Experience provisioning and managing infrastructure using Terraform, Pulumi, or other IaC tools.
Familiarity with AI or LLM-powered products.

Compensation

We offer competitive compensation, startup equity, health insurance, and additional benefits. The U.S. base salary range for this full-time, in-person role in New York is $150,000–$300,000, plus equity and benefits. Our salary ranges are based on location, level, and role. Individual compensation is determined by experience, skills, and job-related knowledge.

Why You Should Join Us

We’ll make sure you’re fully supported with health insurance, a great tech setup, flexible time off, and plenty of in-office snacks. We offer competitive salary and equity packages, and take thoughtful consideration with every hire on our small, high-impact team.

Traversal is fully in-office, 5 days a week, based in New York near Madison Square Park. We have a collaborative, hard-working culture and are energized by building the future of AI-powered software maintenance.

Working here means owning meaningful parts of the product, having the flexibility to move fast, and learning constantly. This is a place to grow your career, make a real impact, and help define a new category of infrastructure software.

Top Skills

C++

Kafka

Ocaml

Postgres

Pulumi

Python

Rust

Terraform

Zig

View all jobs at Traversal

View Traversal Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: New York, New York

0 Employees

What We Do

Traversal is building an AI site reliability engineer that troubleshoots, remediates, and even prevents production issues in complex software systems – always on call, so engineers don’t have to be. Already deployed in some of the world’s largest enterprises, Traversal improves the resilience of mission-critical systems — reducing MTTD and MTTR by up to 90% and supporting services that reach millions globally.