Stord

Staff Site Reliability Engineer

Posted 7 Days Ago

Hiring Remotely in United States

Remote

Senior level

Logistics • Software

The Role

Seeking a Staff Site Reliability Engineer to enhance infrastructure reliability and performance using advanced engineering principles, primarily on Google Cloud Platform.

Summary Generated by Built In

Stord is The Consumer Experience Company, powering seamless checkout through delivery for today's leading brands. Stord is rapidly growing and is on track to double our revenue in the next 18 months. To meet and exceed this target, Stord is strategically scaling teams across the entire company, and seeking energetic experts to help us achieve our mission.

By combining comprehensive commerce-enablement technology with high-volume fulfillment services, Stord provides brands a platform to compete with retail giants. Stord manages over $10 billion of commerce annually through its fulfillment, warehousing, transportation, and operator-built software suite including OMS, Pre- and Post-Purchase, and WMS platforms. Stord is leveling the playing field for all brands to deliver the best consumer experience at scale.

With Stord, brands can increase cart conversion, improve unit economics, and drive sustained customer loyalty. Stord’s end-to-end commerce solutions combine best-in-class omnichannel fulfillment and shipping with leading technology to ensure fast shipping, reliable delivery promises, easy access to more channels, and improved margins on every order.

Hundreds of leading DTC and B2B companies like AG1, True Classic, Native, Seed Health, quip, goodr, Sundays for Dogs, and more trust Stord to deliver industry-leading consumer experiences on every order. Stord is headquartered in Atlanta with facilities across the United States, Canada, and Europe. Stord is backed by top-tier investors including Kleiner Perkins, Franklin Templeton, Founders Fund, Strike Capital, Baillie Gifford, and Salesforce Ventures.

We are seeking a scrappy, high-ownership Staff Site Reliability Engineer (SRE) to join our small, fast-moving SRE team. This role requires someone who can hit the ground running and make an immediate impact on the reliability, scalability, and performance of our production systems. You'll be a key technical leader bridging development and operations, applying advanced software engineering principles to complex infrastructure challenges with minimal hand-holding. In this high-autonomy environment, you'll drive high availability services, architect automation solutions, establish robust monitoring systems, and mentor team members while taking full ownership of critical infrastructure decisions.

What You’ll Do:

Infrastructure & Platform Management

Lead architecture decisions to deliver scalable and reliable infrastructure, primarily on Google Cloud Platform (GCP)
Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, Pulumi, or similar
Manage containerized environments with Docker and Kubernetes
Drive system performance tuning, capacity planning, and resource optimization

Reliability & Monitoring

Define and maintain Service Level Objectives (SLOs) and Indicators (SLIs)
Build robust monitoring, alerting, and observability solutions using Prometheus, Grafana, DataDog, or New Relic
Develop and maintain disaster recovery and business continuity strategies

Automation & DevOps

Design and maintain CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions, etc.)
Automate operational workflows and infrastructure provisioning
Implement configuration management with Ansible, Chef, Puppet, or similar tools
Develop custom tooling and scripts to enhance operational efficiency

Collaboration & Support

Partner with engineering teams to improve deployment practices and application reliability
Provide escalation support for production incidents and lead post-incident reviews
Conduct technical design reviews and offer architectural guidance
Mentor junior engineers on SRE and infrastructure best practices
Participate in on-call rotations for critical systems

What You’ll Need:

Technical Skills

8+ years of experience in site reliability, platform engineering, or infrastructure roles with leadership exposure
Proficiency in at least one programming language (Python, Go, Java, etc.)
Strong hands-on experience with GCP and its core services
Expertise in containerization (Docker) and orchestration (Kubernetes)
Deep knowledge of Infrastructure as Code (Terraform, CloudFormation, etc.)
Skilled in monitoring/observability (Prometheus, Grafana, ELK, etc.)
Solid understanding of networking, load balancing, and distributed systems
Experience with Git and collaborative development workflows

Core Competencies

Exceptional troubleshooting and problem-solving abilities
Strong grasp of system design principles and scalability patterns
Experience with incident management and post-mortem practices
Familiarity with security best practices and compliance standards
Excellent communication skills and ability to work cross-functionally

Preferred Qualifications:

Database administration experience (PostgreSQL, MySQL, Redis, etc.)
Familiarity with event-driven systems and platforms (Kafka, Pub/Sub, etc.)
Experience with log aggregation tools (ELK, Splunk, Fluentd)
Exposure to chaos engineering and resilience testing
Performance testing and optimization experience
Relevant GCP certifications (Cloud Architect, Cloud DevOps Engineer)
Knowledge of GCP-specific services (Cloud Run, GKE, Cloud Functions, BigQuery, etc.)
Experience with multi-cloud or hybrid architectures
Background in functional programming (Elixir, Haskell, F#, Clojure, etc.)
Strong DevOps background and mindset

Top Skills

Ansible

Chef

CloudFormation

Datadog

Docker

Elk

Fluentd

Github Actions

Gitlab Ci

Google Cloud Platform

Grafana

Java

Jenkins

Kafka

Kubernetes

MySQL

New Relic

Postgres

Prometheus

Pub/Sub

Pulumi

Puppet

Python

Redis

Splunk

Terraform

View all jobs at Stord

View Stord Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

Atlanta, GA

222 Employees

Year Founded: 2015

What We Do

Stord is on a mission to migrate supply chains to the cloud—empowering brands to build sophisticated, agile, and integrated supply chains.

Founded in 2015 and headquartered in the heart of Atlanta's vibrant tech community, Stord is pioneering the world's first Cloud Supply Chain. The Cloud Supply Chain is the convergence of the digital and physical elements of logistics. With Stord's Cloud Supply Chain, businesses can build, expand, and optimize their physical supply chain operations across freight, warehousing, and fulfillment, with the speed, flexibility, and ease of modern cloud software.