Blitzy

Senior Site Reliability Engineer

Posted 21 Days Ago

Cambridge, MA, USA

In-Office

160K-180K Annually

Senior level

Artificial Intelligence • Software • Generative AI • Automation

The Role

Lead design, build, and operation of scalable, fault-tolerant cloud infrastructure. Define SLOs/SLAs, improve observability and incident response, own CI/CD and deployment automation, partner with engineering teams on reliability, capacity planning, performance benchmarking, cost optimization, and security for an AI platform.

Summary Generated by Built In

About Blitzy

Blitzy is a Cambridge, MA based AI software development platform on a mission to revolutionize the software development life cycle by autonomously building custom software to unlock the next industrial revolution. We're transforming how enterprises build software, turning enterprise requirements into production-ready code with an agentic software development platform that can autonomously execute 80% of the quantum of software development work. We're backed by multiple tier 1 investors, and have proven success as founders of previous start-ups.

Location: Cambridge, MA (In-Office)

Compensation: $160,000 - $180,000 + equity eligibility based on performance

The Role

As a Senior Site Reliability Engineer at Blitzy's Cambridge headquarters, you will be the backbone of our platform's reliability, scalability, and operational excellence. You'll work at the intersection of software engineering and infrastructure, ensuring our AI-powered development platform remains highly available and performant as we scale rapidly. This is a high-impact, hands-on role for an engineer who thrives in a fast-moving environment and takes deep ownership of the systems they build.

What Success Looks Like

In 30 days: You have a deep understanding of Blitzy's infrastructure architecture, have identified key reliability risks, and are actively contributing to on-call rotations.
In 90 days: You have shipped meaningful improvements to observability, incident response workflows, and deployment pipelines that measurably reduce MTTR and increase system uptime.
In 6 months: You have driven at least one major reliability initiative from inception to production, established SLO/SLA frameworks for critical services, and are a trusted technical voice shaping our infrastructure roadmap.

Areas of Ownership

Design, build, and operate scalable, fault-tolerant infrastructure across cloud environments (AWS, GCP, or Azure).
Define and enforce SLOs, SLAs, and error budgets; lead blameless postmortems and drive systemic improvements.
Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure.
Own observability: design and maintain logging, metrics, tracing, and alerting stacks (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).
Partner closely with software engineering teams to embed reliability practices into the development lifecycle.
Drive capacity planning, performance benchmarking, and cost optimization across our infrastructure.
Champion security best practices within the infrastructure and deployment layers.

Required Experience

5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Strong proficiency in at least one major cloud platform (AWS preferred); experience with Kubernetes and container orchestration at scale.
Hands-on experience with infrastructure-as-code tools (Terraform, Pulumi, or equivalent).
Proven track record designing and maintaining high-availability, distributed systems.
Deep expertise in observability tooling, incident management, and on-call practices.
Strong scripting and automation skills (Python, Go, Bash, or similar).
Excellent communication skills with the ability to collaborate across engineering teams and present technical findings to leadership.

What Makes You Stand Out

Experience supporting AI/ML workloads or GPU-accelerated infrastructure.
Prior experience in a high-growth startup environment where you wore multiple hats.
Familiarity with eBPF, service mesh technologies (Istio, Linkerd), or advanced networking.
Contributions to open-source SRE/DevOps tooling or communities.
Experience building global, multi-region infrastructure with strict latency and availability requirements.

What Makes This Role Different

You won't be maintaining legacy systems or fighting fires in a sprawling monolith. At Blitzy, you're building reliability into a greenfield AI platform that is redefining how the world creates software. You'll have direct influence over architectural decisions, work side-by-side with world-class engineers, and see the tangible impact of your work as we scale to serve Fortune 500 customers. As a founding member of the Pune SRE team, you'll help shape the culture and technical standards of a team that will grow with the company.

Our Culture

Who we are:

Led by two pioneering co-founders we are one of the fastest growing companies in the U.S., creating our own category of enterprise autonomous software development. We automate thousands of hours of software development for our customers, which includes strong representation within the Fortune 500.

How we work:

We move Blitzy Fast: Time is both our company's and our clients' most precious asset. We move quickly and decisively to innovate internally and deliver exceptional software externally.

Championship Mindset: We operate like a professional sports team. We win as a team by holding ourselves and each other to high standards, collaborating in-person, and remaining focused on the mission.

Passion for Invention: We're pushing the frontier of what's possible, requiring constant innovation and iteration.

We Work for the Customer: We focus on delivering outsized value to the customers we work with and expanding those relationships into deep, meaningful partnerships.

We believe in being 'everyday athletes'—taking care of ourselves so we can bring our best minds to work. We promote great sleep, movement, and restorative activities for optimal mental performance. It makes for a happier and more productive team.

Blitzy is an equal opportunity employer committed to building a diverse and inclusive team. We believe different perspectives make us stronger.

Skills Required

5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Strong proficiency in at least one major cloud platform (AWS preferred).
Experience with Kubernetes and container orchestration at scale.
Hands-on experience with infrastructure-as-code tools (Terraform, Pulumi, or equivalent).
Proven track record designing and maintaining high-availability, distributed systems.
Deep expertise in observability tooling, incident management, and on-call practices (Prometheus, Grafana, Datadog, OpenTelemetry).
Strong scripting and automation skills (Python, Go, Bash, or similar).
Excellent communication skills and ability to collaborate across engineering teams and present to leadership.

View all jobs at Blitzy

View Blitzy Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

Year Founded: 2023

What We Do

Blitzy is an autonomous software development platform that enables development teams to transform six-month software projects into six-day turnarounds. By leveraging an agentic platform with thousands of specialized AI agents and 'System 2 Thinking,' Blitzy automates over 80% of the software development lifecycle for enterprise codebases, delivering high-quality, production-ready code with precision and speed.