AI Engineer - Data Platform

Reposted 4 Days Ago
New York, NY
In-Office
150K-300K
Mid level
Software
Our AI SRE agent is on call, so you don’t have to be.
The Role
Design, build, and maintain AI-driven backend systems for observability, focusing on infrastructure development and optimization across cloud and on-premise environments.
Summary Generated by Built In

About Traversal

Traversal is the AI Site Reliability Engineer (SRE) for the enterprise—already trusted by some of the largest companies in the world to troubleshoot, remediate, and even prevent the most complex production incidents. Our mission is to free engineers from endless firefighting and enable them to focus on creative, high-impact work. 

Our roots remain deeply embedded in AI research, and we’re channeling that scientific rigor and creativity into building the premier AI agent lab for the enterprise. Hence, what we’re proudest of is assembling the most talented yet nicest group of individuals, including researchers from MIT, Harvard, and Berkeley, to world-class engineers from industry: Citadel Securities, Cockroach Labs, Datadog, DE Shaw, ServiceNow, Glean, Perplexity, Pinecone, and more, to take on one of the hardest problems for AI to solve. Without the entire team, none of this would be possible.

The Role

As an Infrastructure Engineer on the Data Platform team at Traversal, you’ll design, build, and maintain the backend systems that power our AI-driven observability platform. You’ll work across both cloud and on-prem deployments, ensuring our systems are highly reliable, performant, and capable of supporting large-scale AI operations. This hands-on role blends distributed systems engineering, low-level system design, performance optimization, observability, and AI integration—collaborating closely with engineers across the company to deliver resilient infrastructure that enables our AI agents to diagnose and remediate production incidents in real time.

Responsibilities

  • Architecture & Implementation: Contribute to the design and implementation of scalable, resilient infrastructure systems to power AI-driven root cause analysis and observability workflows. That must work in a variety of environments for on Premises deployments.
  • Low-Level System Design: Work on the foundational building blocks of our infrastructure, ensuring efficient use of resources and high performance at scale.
    Performance Optimization: Profile and tune backend systems to improve throughput, reduce latency, and minimize bottlenecks across the stack.
  • Observability Systems: Help build and maintain the internal observability stack—logs, metrics, and traces—used by our agents to understand and act on production issues.
  • Hybrid Infrastructure: Support cloud and on-prem architecture to serve both SaaS and enterprise customers.
  • Data Infrastructure: Develop and maintain low-latency, high-throughput pipelines using tools like Kafka, Postgres, and S3 for real-time telemetry workflows.
  • Tooling & Automation: Contribute to infrastructure-as-code, CI/CD tooling, and deployment systems to increase platform velocity and stability.
  • Cross-Team Collaboration: Work with AI, platform, and product teams to ensure smooth integration and shared reliability goals.
  • Using Traversal Internally: Help ensure our own observability tooling supports how we debug, monitor, and operate our systems.

Requirements

  • Professional experience with Rust (our primary language for infrastructure), or strong systems-level programming experience in OCaml, C++, C or Zig.
  • Experience building distributed systems using a variety of application-appropriate datastores (e.g., Postgres, object storage, etc.).
  • Strength in debugging across cloud infrastructure, networking layers, and production systems (instrumentation, provisioning, bug fixes, reliability improvements).
  • Experience with performance profiling and optimization in backend systems.
  • Exposure to low-level system design concepts (e.g., concurrency models, storage internals, OS, and DB level tuning).

Nice to Have

  • Experience making complex software systems observable using logs, metrics, and traces.
  • Familiarity with Python-based ecosystems.
  • Background in large-scale, complex, data-driven applications, and familiarity with event streaming platforms such as Kafka.
  • Experience provisioning and managing infrastructure using Terraform, Pulumi, or other IaC tools.
  • Familiarity with AI or LLM-powered products.

Compensation

We offer competitive compensation, startup equity, health insurance, and additional benefits. The U.S. base salary range for this full-time, in-person role in New York is $150,000–$300,000, plus equity and benefits. Our salary ranges are based on location, level, and role. Individual compensation is determined by experience, skills, and job-related knowledge.

Why You Should Join Us

We’ll make sure you’re fully supported with health insurance, a great tech setup, flexible time off, and plenty of in-office snacks. We offer competitive salary and equity packages, and take thoughtful consideration with every hire on our small, high-impact team.

Traversal is fully in-office, 5 days a week, based in New York near Madison Square Park. We have a collaborative, hard-working culture and are energized by building the future of AI-powered software maintenance.

Working here means owning meaningful parts of the product, having the flexibility to move fast, and learning constantly. This is a place to grow your career, make a real impact, and help define a new category of infrastructure software.

Top Skills

C
C++
Kafka
Ocaml
Postgres
Pulumi
Python
Rust
S3
Terraform
Zig
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: New York, New York
0 Employees

What We Do

Traversal is building an AI site reliability engineer that troubleshoots, remediates, and even prevents production issues in complex software systems – always on call, so engineers don’t have to be.

Already deployed in some of the world’s largest enterprises, Traversal improves the resilience of mission-critical systems — reducing MTTD and MTTR by up to 90% and supporting services that reach millions globally.

Similar Jobs

Scale AI Logo Scale AI

Infrastructure Engineer

Artificial Intelligence • Big Data • Machine Learning
In-Office
2 Locations
188K-226K Annually
In-Office
Brooklyn, NY, USA
Hybrid
New York, NY, USA

Similar Companies Hiring

Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees
PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account