Site Reliability / Infrastructure Engineer

Posted 3 Days Ago
Be an Early Applicant
New York City, NY, USA
In-Office
150K-275K Annually
Mid level
Gaming • Mobile • Software
The Role
The SRE role involves ensuring reliability and scalability of the infrastructure, managing incident responses, architecting database strategies, and handling observability while collaborating with engineering teams and overseeing GCP services.
Summary Generated by Built In
The Company

Medal

Medal is the world’s largest and fastest-growing platform for gaming clips, where millions of gamers capture, share, and relive their best moments. Every year, our players record billions of clips, each representing a unique, action-packed highlight. We’re building the next generation of gaming communities: social, monetized, and creator-powered. Our mission is to design products that make sharing, discovering, and connecting around gaming moments seamless and fun.

We raised a seed round of $133M from General Catalyst and Khosla to discover the next generation of intelligence.

The Role

Medal's infrastructure handles billions of clips, video ingestion pipelines, and social features at a massive scale most engineers never get to touch. We're looking for an SRE who cares deeply about reliability and scalability.

The work centers on reliability, incident response, scaling, and making sure our infrastructure keeps up with our growth. You'll own the on-call rotation, drive postmortems, and work directly with engineering teams to meet their infra needs.

The right person probably came through startups and scale-ups. You've been in the room when things broke at 2am, you've scaled databases under pressure, and you know the difference between a durable fix and a patch that buys you a week.

Key Responsibilities
  • Own reliability across our GCP infrastructure: Kubernetes clusters, managed services, and data pipelines, driving measurable improvements to availability and latency

  • Lead incident response end-to-end: on-call rotations, runbooks, postmortems, and the follow-through that makes sure the same thing doesn't happen twice

  • Architect and execute database scaling strategies (sharding, replication, query optimization, and capacity planning) across MySQL and Postgres at meaningful scale

  • Partner with product engineering to translate feature requirements into infrastructure designs that hold up as we grow

  • Manage and evolve our Terraform-managed GCP environment and Kubernetes cluster configurations

  • Own our Elasticsearch cluster end-to-end: capacity planning, sharding strategy, index lifecycle management, version upgrades, and performance tuning at production scale

  • Build and maintain observability across the stack: metrics, dashboards, alerting, and tracing

  • Constantly improve CI/CD reliability and delivery pipelines across GitHub Actions

  • Harden IAM, secrets management, and network segmentation as part of normal infra hygiene

About You
  • You’ve worked at startups and are comfortable in an environment of rapid growth where scaling up is a priority

  • You have great judgment - you know the difference between a durable, sustainable fix vs. a patch that buys you a week

  • You have deep, hands-on experience scaling and sharding relational databases in production environments

  • You know GCP maybe a little too well: Kubernetes, VPC, IAM, Cloud Logging, and the managed services ecosystem

  • You are fluent in Terraform and have owned real infrastructure-as-code at scale

  • You've operated Elasticsearch in production and know how to keep a cluster healthy

  • You have strong incident response instincts: you can work a P0 calmly, communicate clearly under pressure, and run a postmortem that prevents recurrence.

  • You’ve worked with GitHub Actions in a production CI/CD environment.

  • You have excellent communication skills (this is crucial!) and can both flag issues clearly and rapidly during incidents, and lead / write actionable postmortems

Our Stack

Google Cloud Platform

Terraform, Salt, GitHub Actions

Java, Redis, RabbitMQ, ElasticSearch, BigQuery, Kubernetes for backend

Electron+React

C# and C++ for native windows recording & more

Swift for iOS, Kotlin for Android

Benefits
  • Competitive salary and meaningful equity

  • Comprehensive medical, dental, and vision coverage

  • 401(k)

  • Wellness and fitness perks including a Wellhub membership and mental health resources

  • Paid parental leave, fertility and maternal health benefits

  • Generous PTO policy

  • Daily meals and commuter benefits at our NYC HQ in Flatiron

  • Learning and development stipend

Benefits vary by country and employment type.

Skills Required

  • Deep hands-on experience scaling and sharding relational databases
  • Experience with GCP, Kubernetes, IAM, and managed services ecosystem
  • Fluency in Terraform and infrastructure-as-code
  • Strong incident response instincts and calmness under pressure
  • Experience with GitHub Actions in CI/CD environments
  • Excellent communication skills for incident management
  • Experience operating Elasticsearch clusters at production scale
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: New York, New York
57 Employees
Year Founded: 2015

What We Do

About Us: Create gaming memories while apart — Medal enables you to reliably capture and meaningfully share online memories with friends (..that would otherwise be lost to time).

Similar Jobs

CoreWeave Logo CoreWeave

Senior Site Reliability Engineer

Cloud • Information Technology • Machine Learning
In-Office
2 Locations
1450 Employees
165K-242K Annually

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
5 Locations
5550 Employees
127K-249K Annually

Milestone Systems Logo Milestone Systems

Site Reliability Engineer

Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
Remote or Hybrid
United States
1500 Employees
160K-180K Annually

RunSybil Logo RunSybil

Site Reliability Engineer

Information Technology • Automation
Hybrid
2 Locations
13 Employees
30K-120K Annually

Similar Companies Hiring

Milestone Systems Thumbnail
Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
Lake Oswego, OR
1500 Employees
Fairly Even Thumbnail
Hardware • Other • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account