Senior Software Engineer, Infrastructure

Reposted 8 Days Ago
Hiring Remotely in US
Remote
Mid level
Digital Media • Edtech
The Role
Develop and maintain core backend systems and data infrastructure; collaborate with teams on scalable solutions and ensure data quality and reliability.
Summary Generated by Built In
About Us

Epic is the leading digital reading platform for kids ages 12 and under, used by millions of children, families, and educators around the world. With a vast library of high-quality books and learning resources from 250+ of the world’s top publishers, Epic empowers kids to explore their interests, build literacy skills, and develop a lifelong love of reading.

Through personalized recommendations and built-in progress tracking, Epic helps children build confidence and curiosity—while giving parents and educators meaningful insight into each child’s learning journey. As Epic continues to grow, we are reimagining what reading can be through thoughtful technology, data, and global collaboration to make learning more engaging, accessible, and impactful.

Position Summary

The Senior Software Engineer, Infrastructure will play a key role in driving the stability, observability, and overall reliability of Epic's platform as we grow. You are an experienced engineer who works independently on complex infrastructure problems, makes sound technical decisions, and helps raise the bar for the engineers around you. You will own meaningful pieces of our GCP infrastructure, container platform, CI/CD pipelines, and observability stack—setting reliability standards, hardening the systems behind them, and making sure issues are caught early and resolved fast. You will partner closely with both our product engineering and data engineering teams to keep the platforms that power their applications and workflows running reliably.

This is a fully remote, US-based role working closely with a global, bilingual (English–Chinese) engineering team.

Key Responsibilities
  • Drive the stability and reliability of Epic's GCP infrastructure—setting and tracking SLOs/SLIs, reducing toil, and engineering out recurring sources of instability
  • Build and operate Epic's GCP infrastructure for high availability, scalability, and cost efficiency
  • Manage and harden our Docker and GKE container platform, including workload scheduling, autoscaling, networking, and graceful failure handling
  • Maintain and improve CI/CD pipelines that enable fast, safe, low-risk delivery across engineering teams
  • Own and evolve the observability stack—metrics, logs, traces, dashboards, and alerts—so that signals are actionable, noise is low, and on-call has the context to resolve issues quickly
  • Write and maintain Terraform to codify infrastructure across the organization, with a focus on consistency, change safety, and reproducibility
  • Contribute to capacity planning, cost optimization, and architectural reviews, with reliability as a first-class consideration
  • Champion platform security best practices, including secrets management, IAM policies, and network segmentation
  • Support compliance-aware infrastructure practices—vulnerability management, access reviews, audit-evidence flows, and incident-response readiness—as we mature our SOC 2 and student-data compliance programs
  • Partner with data engineering to operate the orchestration platform and supporting infrastructure—deployment, scaling, reliability, and observability
  • Collaborate with backend and data engineers to troubleshoot service and platform issues
  • Lead by example in a frequent on-call rotation; drive incident response, blameless post-mortems, and the follow-through that turns one-time outages into systemic, lasting reliability improvements
  • Provide guidance to developers on infrastructure concerns and best practices
Required Qualifications
  • Bachelor's degree or higher in Computer Science, Software Engineering, or a related field
  • 5+ years of experience in infrastructure, platform, DevOps, or a related engineering role
  • Hands-on experience with GCP (GCE, GCS, VPC, IAM, Cloud Monitoring, and related services)
  • Experience with Docker and Kubernetes (GKE)—containerizing workloads, deploying to GKE, Helm, and cluster fundamentals
  • Experience with CI/CD pipelines (GitHub Actions, ArgoCD, Jenkins, or similar)
  • Experience with an observability platform such as New Relic (metrics, logging, alerting, dashboards)
  • Proficiency in Terraform for managing infrastructure as code
  • Scripting/programming skills in Python, Bash, or similar
  • Comfort participating in a frequent production on-call rotation
  • Track record of measurably improving reliability of production systems—e.g., defining SLOs, reducing incident frequency or MTTR, eliminating recurring failure modes
  • Strong problem-solving skills, sense of ownership, and ability to work effectively in evolving systems
  • Fluency in English for daily collaboration and technical documentation
  • Proficiency in Mandarin Chinese to collaborate effectively with global engineering and business partners
Preferred Skills
  • Experience operating workflow orchestration platforms (e.g., Dagster, Airflow) as a service for data or platform teams
  • Familiarity with the operational footprint of data platforms (warehouse infrastructure, job schedulers, batch workloads)
  • Experience in distributed or global engineering teams
  • Working knowledge of compliance frameworks (e.g., SOC 2, FERPA, COPPA) and GRC tools.

Salary - 160K to 200K (bonus included)

 

 

Skills Required

  • Bachelor's degree in Computer Science or related field
  • Strong experience working with databases and advanced SQL skills
  • Proficiency in at least one programming language (Python, Scala, or Java)
  • Working knowledge of big data technologies (Hadoop, HDFS, Hive, Spark)
  • Solid understanding of enterprise data warehouse design principles
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Redwood City, CA
120 Employees
Year Founded: 2014

What We Do

Designed for unlimited discovery and unmatched safety, Epic is the leading digital reading platform for kids. Reaching more than 50 million children in homes and classrooms, Epic fuels curiosity and reading confidence in kids to explore their interests and learn in a fun, safe, kid-friendly environment. Our award-winning service is built on a collection of 40,000+ popular, high-quality books, audiobooks and videos from 250+ of the world’s best publishers including HarperCollins, Macmillan, Sesame Street, National Geographic Kids and Smithsonian. Epic provides free access to educators and is used by more than 2 million teachers in the classroom. The company was founded by Suren Markosian, founder of several successful technology startups, and Kevin Donahue, former YouTube, Google and Disney executive, with the support of top tier investors and veterans of the children’s publishing industry. It is part of the BYJU’S family of brands, working together to unlock a love of learning around the world. To learn more, visit getepic.com, or follow Epic! on Facebook and Twitter. To learn more about Epic, please visit www.getepic.com.

Similar Jobs

Optum Logo Optum

Platform Engineer

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
In-Office or Remote
Basking Ridge, NJ, USA
160000 Employees
92K-164K Annually

Airwallex Logo Airwallex

Senior Software Engineer

Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Remote or Hybrid
Seattle, WA, USA
2000 Employees
Remote
United States
1000 Employees
160K-210K Annually

Affirm Logo Affirm

Senior Software Engineer

Big Data • Fintech • Mobile • Payments • Financial Services
Easy Apply
Remote
United States
2200 Employees
169K-240K Annually

Similar Companies Hiring

Hedra Thumbnail
Software • News + Entertainment • Marketing Tech • Generative AI • Enterprise Web • Digital Media • Consumer Web
San Francisco, CA
14 Employees
Learneo Thumbnail
Software • Machine Learning • Edtech • Artificial Intelligence
NL
397 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account