Senior Site Reliability and Infrastructure Engineer

Posted Yesterday
Be an Early Applicant
Hiring Remotely in Office, Machaze, Manica, MOZ
Remote
160K-220K Annually
Senior level
Machine Learning • Robotics • Software
The Role
First full-time SRE/infrastructure engineer responsible for scaling and hardening a platform that runs Airflow DAGs on Astronomer across AWS and Kubernetes. Build observability, monitoring, alerting, SLOs, CI/CD guardrails, and operational tooling for ML inference and data pipelines. Partner with data and engineering teams, create runbooks, reduce toil, and improve operational readiness for customer-facing high-volume processing.
Summary Generated by Built In

In the face of rising threats like severe storms and wildfires, increasing pressure on affordability, and unprecedented demands for system expansion, Treeswift empowers energy companies to modernize their field work to meet the growth and challenges ahead.

To accomplish our mission, we deploy our sensors into our customers' field operations, typically on backpacks or vehicles. The resulting trove of LiDAR and imagery data is processed through our AI models to deliver actionable analytics through our web platform. To date, our technology has enabled utilities to reduce wildfire, regulatory and outage risk from vegetation, avoid delays and cost overruns in new construction, and accelerate recovery from severe storms.

Since our first utility pilot in June 2024, we have rapidly expanded and now work with three of the five largest utilities in the United States and are expanding across new customers and use cases.

To tackle this challenge, we are bringing together a team of mission-driven experts with deep industry experience in robotics (Penn, Caltech, CMU) and enterprise software development (Palantir, Stripe, Oracle, MongoDB). We have raised funding from leading investors including Penny Pritzker’s Inspired Capital.

Treeswift is headquartered in lower Manhattan, and maintains an office in Philadelphia. We also have some customer-facing team members based closer to our customer sites (i.e. Bay Area). We strongly support our employees (including software engineers) to visit customer sites — ask us about this!

We hope you’ll join us on this journey.

About the role

  • You’ll be our first full-time SRE/infrastructure engineer, so we’ll look to you for leadership on how to improve and scale our infrastructure to support each part of the platform. Our data pipeline, machine learning training platform, and web app could all benefit from further productionization.

  • Help us scale and harden the platform that schedules our pipelines, runs machine learning training, and hosts our web app. We run Apache Airflow on Astronomer with DAGs that orchestrate high-volume processing across AWS and Kubernetes, including machine learning inference inside pipeline tasks. You will build the observability and reliability foundations that let us run this system confidently as customer data volume grows: monitoring, alerting, performance/cost visibility, and clear operational practices.

  • Stay curious, collaborative, and cross-functional while also taking ownership of problems. We translate complex, real-world requirements from a critical industry into high-quality data products, so understanding the business holistically is key. We take pride in managing complexity and providing high-fidelity data that our customers can use to make better-informed decisions.

Responsibilities
  • Partner with the data platform and engineering teams to understand how changes propagate across pipeline execution (Astronomer-hosted Airflow DAGs), containerized workers (Kubernetes), and AWS services (S3, SQS, Lambda, Step Functions, ECS).

  • Design and implement reliability and observability for high-volume pipeline operations, including:

    • actionable monitoring/alerting for DAG/task failures and reruns

    • visibility into operational workflows like flight orchestration (including DLQ/failed-message alerting and notification pathways)

    • dashboards and SLO/SLI definitions focused on correctness, throughput, and pipeline health

  • Own CI/CD guardrails for production changes: build/deploy validation and safe rollout mechanics for Astronomer deployments (image builds pushed to ECR, and Airflow configuration updates via Astronomer CLI variable updates)

  • Make machine learning inference operations more reliable and observable:

    • instrument inference runs executed inside pipeline runners (model checkpoint resolution, S3 sync behavior, thresholds and fallback behavior, and output correctness)

    • add operational visibility for inference outcomes (e.g., unknown classification rates, fallback usage, and failure modes)

  • Create operational tooling and continuously improve systems (‘leave it better than you found it’), including:

    • runbooks, incident learnings, and engineering standards for debugging at scale

    • automate away toil in deployment and operations workflows as we learn what hurts most

On-call / incident response

There is not currently an established on-call rotation for this platform, and the pipelines do not require real-time processing. That said, you’ll still help lead reliability improvements and operational readiness—so the team has faster diagnosis, better alerts, and safer releases when issues do occur.

What we’re looking for
  • You are an experienced software engineer where the last 7-10 years required significant time on observability, systems/infrastructure engineering, SRE, or DevOps (ideally in a cloud environment).

  • Ability to reason about architecture end-to-end and articulate your thoughts with product impact in mind (data movement, execution, failure handling, and operational visibility).

  • Hands-on experience with infrastructure-as-code (Terraform and similar) and using it to deliver reliable environments.

  • Experience with container orchestration and debugging in practice (Kubernetes and/or ECS/container-based deployments).

  • Strong Linux debugging skills and demonstrated ability to investigate production issues with logs/metrics and clear hypotheses.

  • Empathy and communication: you can collaborate effectively with engineers across teams (especially the data platform team) and explain tradeoffs clearly.

Nice-to-haves
  • Experience working in early-stage or fast-moving environments where ownership and processes evolve quickly.

  • Experience with Apache Airflow and/or Astronomer.

  • Experience with AWS, although other cloud providers are fine. (DuploCloud experience is also helpful.)

  • Experience with geospatial/imagery/lidar/point-cloud style domains.

  • ML Ops skills (model deployment/inference reliability, packaging, CI/CD for model artifacts, and operational observability for inference pipelines).

Work location

This is a full-time, hybrid role based out of our Lower Manhattan, NYC office (2 days per week in person, currently pinned to Tuesdays and Wednesdays).

Salary

The estimated salary range for this position is $160,000 - 220,000 USD. Total compensation for this position is determined by skills, qualifications, relevant work experience, location, and other factors. This salary estimate excludes the value of any potential bonuses; the value of any benefits offered; and the potential future value of any long-term incentives. This information is provided per the New York City Human Rights Law. Please note that the range provided is applicable only to New York City-based applicants. Base compensation may vary if the work location is outside of New York City.

Treeswift  is proud to be an equal opportunity employer. We provide employment opportunities without regard to age, race, color, ancestry, national origin, religion, disability, sex, gender identity or expression, sexual orientation, veteran status, or any other protected status in accordance with applicable law.

If you require any accommodations during the recruitment process, whether it be alternate forms of material, accessible meeting rooms, etc., please let us know and we will work with you to meet your needs. 

Skills Required

  • 7-10 years of experience with observability, systems/infrastructure engineering, SRE, or DevOps
  • Ability to reason about end-to-end architecture and communicate product impact
  • Hands-on experience with infrastructure-as-code (Terraform or similar)
  • Experience with container orchestration and debugging (Kubernetes and/or ECS)
  • Strong Linux debugging skills and production incident investigation experience
  • Experience designing observability, monitoring, alerting, dashboards, and SLO/SLI definitions
  • Empathy and effective cross-team communication
  • Experience with Apache Airflow and/or Astronomer
  • Experience with AWS (S3, SQS, Lambda, Step Functions, ECS, ECR)
  • Experience in early-stage or fast-moving environments
  • Experience with geospatial/imagery/LiDAR/point-cloud domains
  • ML Ops skills for model deployment, inference reliability, and CI/CD for model artifacts
  • Familiarity with DuploCloud
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Philadelphia, PA
27 Employees
Year Founded: 2020

What We Do

Treeswift is building the next generation of forest monitoring systems. We provide forest stakeholders with precision data and analyses that are easily accessible and flexible. Our services are used in carbon capture estimation, timber value estimation, deforestation monitoring, advanced growth forecasting, and forest management. We use state-of-the-art Robotic and Machine Learning technology to build the forestry tools of the future.

Similar Jobs

CrowdStrike Logo CrowdStrike

Growth Development Representative (Hybrid)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
Office, Machaze, Manica, MOZ
10000 Employees

Mondelēz International Logo Mondelēz International

Global Consumer Data Platform Product Lead

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
3 Locations
90000 Employees
Remote or Hybrid
Office, Machaze, Manica, MOZ
93 Employees

Centari Logo Centari

Senior Software Engineer

Artificial Intelligence • Legal Tech • Professional Services • Software
Remote or Hybrid
Office, Machaze, Manica, MOZ
8 Employees
150K-200K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account