L2 Support Engineer

Posted 5 Days Ago
Be an Early Applicant
Cambridge, MA, USA
In-Office
100K-140K Annually
Mid level
Artificial Intelligence • Software • Generative AI • Automation
The Role
Provide L2 support for an AI software platform: deploy/install into customer environments, triage and resolve issues across Kubernetes, cloud providers, networking, storage, and services; build runbooks, dashboards, and alerts; reproduce issues safely; write evidence-backed escalations and post-incident notes; participate in rotating on-call coverage aligned to US time zones.
Summary Generated by Built In

About Blitzy

Blitzy is a Cambridge, MA based AI software development platform on a mission to revolutionize the software development life cycle by autonomously building custom software to unlock the next industrial revolution. We're transforming how enterprises build software, turning enterprise requirements into production-ready code with an agentic software development platform that can autonomously execute 80% of the quantum of software development work. We're backed by multiple tier 1 investors, and have proven success as founders of previous start-ups.

Location: Cambridge, MA (On-site)

Compensation: $100,000 - $140,000 salary + equity

 

The Role

The role is to support our clients and ensure a stable environment across the full lifecycle: installation, ongoing upgrades, and day-to-day operation. The L2 Support Engineer works alongside L1 to triage and resolve issues, and escalates unresolved defects to engineering. It operates across Kubernetes, Docker, and the major cloud providers.

 

What Success Looks Like

  • Customers' issues are resolved faster and escalated cleaner.

  • Recurring problems turn into runbooks, dashboards, and alerts, not repeat tickets.

  • Engineering trusts your escalations because they come with proof, not guesses.

  • Customers trust your communication because it's clear, honest, and on time.

 

Areas of Ownership

  • Deploy and install the platform into customer environments, and troubleshoot installation issues.

  • Support ongoing upgrades and day-to-day operation, keeping customer environments stable.

  • Work alongside L1 to triage and resolve customer-reported issues, driving them to resolution or escalation.

  • Diagnose failures across the stack: compute, networking, storage, and the services running on it.

  • Reproduce issues safely against live (often multi-tenant) environments using read-only diagnostics first.

  • Build and maintain dashboards, monitors, and runbooks so recurring issues get faster to fix: or stop recurring.

  • Write up clear, evidence-backed escalations and post-incident notes.

  • Communicate status and resolution to customers clearly and on time.

 

Required Experience

  • Distributed-systems debugging. Reason about a request crossing multiple services, queues, and network hops, and isolate which hop failed. You debug by forming a hypothesis and confirming it with evidence (logs, pod state, queue depth, DB rows), not by guessing.

  • Kubernetes & Docker.

  • Major cloud providers: GCP, AWS, and Azure. Hands-on with at least one deeply and able to work across the others: managed Kubernetes (GKE/AKS), cloud logging, IAM/auth basics, and cloud disk/storage behavior.

  • Strong monitoring & observability practice. Fluent with an APM/observability stack (Datadog or equivalent): log queries, correlating across services by request/trace IDs, reading traces, and building dashboards and alerts. You reach for the data before theorizing.

Additional Skills & Experience

  • Python and Redis literacy.

  • Basic message queueing. Command transport runs over a message queue (Redis/rq). Comfort inspecting queue depth, backlogs, and stuck/failed jobs; concepts transfer from any broker.

  • Networking & WebSockets. Many of our hardest issues are connection problems: WebSocket/Socket.IO drops, NAT/idle/LB timeouts, half-open sockets, DNS-vs-routing, TLS. Tell a transport fault from an application fault.

  • SQL / PostgreSQL. Query operational tables to confirm what the system recorded.

  • Source-control platforms. GitHub (incl. GitHub Enterprise Server), Azure DevOps, and/or GitLab, clone/push/pull, access tokens, app credentials, and their failure modes.

  • CI/CD, Helm & deploy integrity. Many "sudden regressions" are a bad or partial deploy: check what version is actually running before chasing architecture theories. Helm and container deploy pipelines expected. ArgoCD is a plus.

  • Secrets management. Comfort handling secrets, credentials, and certificates safely, ideally with Vault (strongly preferred).

  • Linux and Windows. Workloads run on both; comfort triaging on each OS (process inspection, filesystem, basic networking).

  • Methodical, evidence-first temperament. Hold several candidate causes at once, run the cheapest disconfirming check first, and never claim a root cause or fix you haven't proven.

  • Multi-tenant safety mindset. Environments are shared and customer-owned: default to read-only diagnostics and understand blast radius before changing anything.

  • Incident management & ticketing workflows: Jira or similar (a plus).

  • Prior customer-facing support or SRE/on-call experience (a plus).

 

Hours & On-Call: please read

This is a customer support role, and the hours can be unconventional. Customers operate primarily in US time zones, so coverage is anchored to US business hours (roughly ET–PT). If you're based outside the US, expect your working day to shift accordingly.

Incidents don't keep office hours. Expect a rotating on-call schedule and occasional evening, early-morning, or weekend escalations outside a standard 9–5. We structure for it: rotations are shared fairly, on-call is compensated/time-off-in-lieu per policy, and we protect recovery time after heavy incidents.

If you're not comfortable with US-aligned hours and periodic off-hours on-call, this likely isn't the right role, and that's completely fine.

 

Our Culture

Who we are:

Led by two pioneering co-founders we are one of the fastest growing companies in the U.S., creating our own category of enterprise autonomous software development. We automate thousands of hours of software development for our customers, which includes strong representation within the Fortune 500.

How we work:

We move Blitzy Fast: Time is both our company's and our clients' most precious asset. We move quickly and decisively to innovate internally and deliver exceptional software externally.

Championship Mindset: We operate like a professional sports team. We win as a team by holding ourselves and each other to high standards, collaborating in-person, and remaining focused on the mission.

Passion for Invention: We're pushing the frontier of what's possible, requiring constant innovation and iteration.

We Work for the Customer: We focus on delivering outsized value to the customers we work with and expanding those relationships into deep, meaningful partnerships.

We believe in being 'everyday athletes'—taking care of ourselves so we can bring our best minds to work. We promote great sleep, movement, and restorative activities for optimal mental performance. It makes for a happier and more productive team.

Blitzy is an equal opportunity employer committed to building a diverse and inclusive team. We believe different perspectives make us stronger.

Skills Required

  • Distributed-systems debugging (hypothesis-driven, evidence-based triage across services, queues, and network hops)
  • Kubernetes
  • Docker
  • Experience with major cloud providers (GCP, AWS, Azure) and managed Kubernetes (GKE/AKS); cloud logging and IAM/auth basics
  • Strong monitoring and observability practice (APM/Datadog or equivalent; log queries, traces, dashboards, alerts)
  • Willingness to work US-aligned hours and participate in a rotating on-call schedule (evenings, early mornings, weekends as needed)
  • Python literacy
  • Redis literacy (including Redis/rq concepts, queue depth/backlog inspection)
  • Basic message queueing concepts and troubleshooting
  • Networking and WebSockets troubleshooting (Socket.IO, NAT/idle/LB timeouts, TLS, DNS/routing)
  • SQL / PostgreSQL operational queries
  • Familiarity with source-control platforms (GitHub, GitHub Enterprise Server, Azure DevOps, GitLab) and access token/app credential failure modes
  • CI/CD and deployment integrity experience; Helm; familiarity with container deploy pipelines (ArgoCD a plus)
  • Secrets management experience, ideally HashiCorp Vault (strongly preferred)
  • Comfort triaging on Linux and Windows (process inspection, filesystem, basic networking)
  • Incident management and ticketing workflows (Jira or similar)
  • Prior customer-facing support or SRE/on-call experience
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Year Founded: 2023

What We Do

Blitzy is an autonomous software development platform that enables development teams to transform six-month software projects into six-day turnarounds. By leveraging an agentic platform with thousands of specialized AI agents and 'System 2 Thinking,' Blitzy automates over 80% of the software development lifecycle for enterprise codebases, delivering high-quality, production-ready code with precision and speed.

Similar Jobs

MetLife Logo MetLife

Customer Care Advocate Disability Service - Virtual 8.3.26 - 18200

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
43000 Employees
42K-42K Annually

MetLife Logo MetLife

Consultant

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
43000 Employees
107K-132K Annually

ServiceNow Logo ServiceNow

Architect

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Waltham, MA, USA
29000 Employees
134K-222K Annually

Dynatrace Logo Dynatrace

Sr Learning & Development Advisor

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Big Data Analytics • Automation
Remote or Hybrid
Boston, MA, USA
5600 Employees

Similar Companies Hiring

Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
LTX Thumbnail
Conversational AI • Generative AI
Jerusalem, Israel
360 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account