Edgescale AI

Principal Core Engineer — Infra / SRE

Reposted Yesterday

Be an Early Applicant

Denver, CO, USA

Hybrid

190K-215K Annually

Expert/Leader

Artificial Intelligence • Information Technology • Software • Infrastructure as a Service (IaaS)

The Role

Own fleet-scale reliability, upgradeability, and operational excellence for an edge platform. Design and operate automated, secure lifecycle systems, observability, and incident response. Lead cross-domain, high-severity incident ownership, set production standards (SLOs/SLIs, change management), mentor engineers, and apply AI to accelerate diagnostics and operational workflows.

Summary Generated by Built In

The Opportunity

We’re looking for a Core Engineer at the Principal Infra / SRE level to own the reliability, scalability, upgradeability, and operational excellence of our edge platform at fleet scale.

In this role, you’ll be the technical authority for designing and operating compound capabilities that span software, infrastructure, networking, security, data, and hardware—ensuring we can reliably deploy, upgrade, and manage fleets of thousands of devices with the highest technical rigor. You will set and enforce production standards, and you have the authority to stop changes that would put fleet safety or reliability at risk. During high-severity incidents, you are the technical owner—leading root-cause analysis and driving fixes across teams.

This is a hands-on role for someone who thrives in a high-ownership setting and wants to build the infrastructure that makes real-world AI possible. You’ll operate in an AI-native way, using AI to assist diagnostics and operations while ensuring all production changes remain governed, reviewed, and auditable.

What You’ll Do

Own platform-wide reliability and scalability architecture across the fleet, including upgradeability, rollback safety, resilience, observability, and incident response.
Lead the design and delivery of compound capabilities that span multiple specialist domains (hardware, networking, security, data, infrastructure, and AI runtime).
Set and enforce production-grade standards for operational excellence, including SLOs/SLIs, error budgets, on-call readiness, change management, incident management, and postmortem practices, with the authority to stop changes that introduce unacceptable risk.
Serve as the technical owner during high-severity incidents, leading diagnosis, root-cause analysis, and coordinated remediation across teams.
Design and operate secure, automated fleet lifecycle systems for deployment, updates, configuration management, and health management at scale.
Drive the evolution of observability and telemetry systems (metrics, logs, traces, audit, fleet state) so issues are detectable, diagnosable, and preventable.
Partner with engineering and commercial teams to translate real-world constraints into platform-level requirements and prioritization decisions.
Operate in an AI-native way: develop and use AI systems to accelerate diagnostics, automate operational workflows, and increase engineering velocity, while ensuring all production changes remain governed, reviewed, and auditable.
Mentor senior engineers across domains, review technical designs, and raise the quality bar for architecture and reliability across the organization.

What Success Looks Like

In your first 3 months, you will have:

Taken full ownership of a platform-wide reliability, upgradeability, or incident reduction initiative and delivered measurable improvements in fleet stability, deployment safety, and operational clarity.
Established or strengthened production standards that reduce risk and improve consistency across releases and fleet operations.
Demonstrated strong incident ownership by leading at least one high-severity investigation through root cause and durable remediation.

In your first year, you will be:

Owning the fleet-scale operational architecture end-to-end, with clear accountability for reliability, upgradeability, scalability, and security posture across thousands of deployed systems.
Delivering step-function improvements in platform resilience and operational excellence through durable systems (automated lifecycle management, observability, incident reduction, reliability standards).
Raising engineering rigor across the organization by enforcing standards, mentoring technical leaders, and driving cross-domain architectural decisions that compound over time.

Who You Are

10+ years building and operating production infrastructure and distributed systems, including reliability engineering at scale across complex, multi-tenant or fleet environments.
Deep experience with SRE practices: SLOs/SLIs, error budgets, observability, incident response, postmortems, and operational automation (e.g., Kubernetes-based platforms, Linux systems, and automation through infrastructure-as-code).
Strong systems thinking across software, infrastructure, networking, and security, with the ability to drive outcomes across multiple domains and enforce production standards.
Proven ability to lead ambiguous, high-impact initiatives end-to-end with strong technical judgment, crisp execution, and disciplined change management.
Clear communicator and trusted technical partner to engineering leadership, with the ability to lead high-severity incident response and drive cross-team alignment.
Ownership mindset: outcomes over tasks.

Unique Experiences We Value

Designing and operating fleet management and upgrade systems at scale, including safe rollout/rollback, configuration management, and health monitoring (e.g., canary deployments, staged rollouts, and verifiable rollback mechanisms).
Building observability platforms that make complex systems diagnosable and measurable across large distributed deployments (e.g., metrics/logs/tracing pipelines, alerting, and dashboards that drive action).
Security-first operations experience (secure boot, signed updates, audit logging, default-deny posture) and working in compliance-sensitive environments with governed production changes.
Experience operating systems under real-world edge constraints (limited connectivity, bandwidth limits, variable environments, high reliability requirements) and building automation that reduces operational variance.
Applying AI to operations and engineering workflows (automated diagnostics, agentic triage, runbook generation, anomaly detection) to increase rigor and speed while keeping production pathways reviewed and auditable.

Benefits

We work in a high-ownership, real-world startup environment where you’ll move fast, build new systems, and see your impact immediately—what you ship runs in the field and drives measurable customer outcomes.
We work alongside AI every day. Writing static code, docs, or plans “by hand” is no longer accepted—here you’ll use the latest AI tools to iterate and ship faster and to apply AI with our customers at scale.
You’ll take on elite technical challenges at the frontier of infrastructure, including next-generation cloud and IoT, hardware/software/networking in real-world edge environments, the foundation for data and AI inference, and industry-leading secure systems in demanding operational (OT) settings.
You’ll learn fast by working with exceptional teammates and collaborating directly with industry leaders as partners in software, AI, and infrastructure.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. This role has a base salary range of $190,000–$215,000.
Total compensation for this role includes equity in your work. You are eligible for meaningful equity through stock options in an early-stage, high-growth company.
You are eligible to participate in company benefit plans, which may include health, dental, and vision coverage, a 401(k) with company match, flexible PTO, paid parental leave, commuter benefits, and relocation and visa support for eligible roles.

Edgescale AI

At Edgescale AI, we’re deploying AI in the real world—helping customers apply this technology to unlock transformative productivity gains. Our work sits at the intersection of infrastructure, security, networking, and AI, where reliability and performance are non-negotiable and where solutions demand deep, distributed systems thinking.

We’re intensely AI-native. We build with AI, we ship AI, and we use it every day to accelerate how we design, test, deploy, and operate complex systems. If you want to help pave the application of AI in the real world, at global scale, we want to hear from you.

Edgescale AI is building an inclusive, merit-based organization. We are an equal opportunity employer and do not discriminate on any legally protected status. We value diversity, inclusion, and a shared passion for creating real-world impact.

Skills Required

10+ years building and operating production infrastructure and distributed systems
Deep experience with SRE practices: SLOs/SLIs, error budgets, observability, incident response, postmortems, operational automation
Experience with Kubernetes-based platforms, Linux systems, and automation through infrastructure-as-code
Strong systems thinking across software, infrastructure, networking, and security
Proven ability to lead ambiguous, high-impact initiatives end-to-end with disciplined change management
Clear communication and ability to lead high-severity incident response and cross-team alignment
Designing and operating fleet management and upgrade systems at scale (safe rollout/rollback, staged rollouts, canary deployments)
Building observability platforms (metrics, logs, tracing pipelines, alerting) and security-first operations (signed updates, audit logging, secure boot)
Experience operating systems in constrained edge environments and applying AI to operations/workflows

View all jobs at Edgescale AI

View Edgescale AI Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

20 Employees

What We Do

Edgescale AI provides high-performance edge AI infrastructure designed to bridge the gap between the cloud and the physical edge. The company enables real-time intelligence and operational productivity gains across sectors like manufacturing, utilities, and transportation by deploying secure, on-site AI systems-in-a-box that ensure data sovereignty and automate the connection between physical devices and AI systems.