At GoGuardian, we're helping share the future of digital learning by providing educators, students, and schools with tools to create engaging and equitable learning environments. Together, we build innovative solutions to empower students, deliver insights, and encourage experimentation. With employees around the globe, we're committed to building a culture of inclusivity, curiosity, and courage. GoGuardian's growth is fast and ever-evolving, and our teams are growing along with it - always ready to experiment and learn.
We're here for the cause, but also for the culture. We celebrate our successes and wins together, and make time to appreciate our teammates every day. Take a peek into your future at GoGuardian: our Slack channels include #gardenclub, #boardgametime, #bookrecs, and #petphotos. There's always something fun going on, including concerts, classes, book clubs, and more! From virtual trivia to local meet-ups, Guardians are always finding ways to connect.
Tech Foundation team's mission is to enable GoGuardian engineers to deliver products at speed with high quality, where quality includes but is not limited to functionality, reliability, availability, security, privacy, etc. SRE is one of the most critical pillars to ensure that best practices are employed throughout the engineering development process.
The Senior Site Reliability Engineer is a critical role at GoGuardian as we expand our services to multiple platforms (Windows, macOS, iOS, Android, etc) across the globe, impacting millions of students and educators every day. The reliability, availability, and performance of our products are critical to our success. You'll join a team of motivated and empowered engineers, and will work closely to drive the adoption of modern reliability practices like SLOs, error budget policies, actionable alerts, incident retrospectives, chaos testing, and end-to-end ownership.
What You'll Do
- Engineering solutions to reduce toil
- Monitoring and metrics: For example, detecting response latency, error or unanswered query rate, and peak utilization of resources
- Emergency response: Running on-call rotations, traffic-dip detection, primary/secondary/escalation, writing playbooks, running Wheels of Misfortune
- Capacity planning: Doing quarterly projections, handling a sudden sustained load spike, running utilization-improvement projects
- Service turn-up and turn-down: For services that run in many locations (e.g., to reduce end-user latency and increase fault tolerance through redundancy), planning location turn-up/down schedules and automating the process to reduce risks and operational load, cloud infrastructure management
- Change management: release/deployment tooling, canarying, 1% experiments, rolling upgrades, quick-fail rollbacks, and measuring error budgets
- Performance: Stress and load testing, resource-usage efficiency monitoring, and optimization.
- Data Integrity/Reliability: Ensuring that non-reconstructible data is stored resiliently and highly available for reads, including the ability to rapidly restore it from backups
Who You Are
- 5+ years and/or 3+ previous industry position(s) w/ dedicated experience as an operations engineer, DevOps engineer or SRE supporting SaaS applications in large-scale cloud environments
- Direct involvement in shipping multiple production SaaS applications/products in varied disciplines or verticals
- Proficiency with the following technologies and practices: AWS and/or GCP, ECS and/or Kubernetes (CKA or similar certification preferred), Docker, Terraform
- On the job exposure or experience with the following technologies preferred: Golang, Python, Datadog, Prometheus, MongoDB, Redis, Firebase Realtime Database, BigQuery
- History of or experiences with strong Network Management (firewalls, proxies, IP management, routing, DNS)
- Experience with authentication protocols such as SAML, OAuth, or OpenID Connect is a plus
- Has software development experience and/or understanding of programming languages, data structures, and algorithms, writes production-grade code for well-scoped features; integrates feedback from code reviewers
- Learns quickly, applies existing knowledge to new challenges, and is building mastery in relevant technical skills
- Confident in making technical decisions and explaining the reasoning behind them
- Comfortable developing solid technical solutions to ambiguous or open-ended problems
- Driven to teach, lead and help others in areas of strongest skill and experience
What We Offer
- A varied and challenging role in a multinational and highly innovative company
- A robust benefits package including health insurance, 401(k) retirement savings plan with company match, employee stock option plan, paid parental leave, 13 paid company holidays, and much more
- Development and further training opportunities for shaping and realizing your career goals
- Exceptional colleagues with a passion for EdTech
P.S. - Share this with your friends or co-workers who may be interested in working at GoGuardian! We have multiple openings and are always looking for amazing people.
GoGuardian is an equal opportunity employer and makes employment decisions on the basis of merit and business needs. GoGuardian does not discriminate against employees, applicants, interns or volunteers on the basis of race, religion, color, national origin, ancestry, physical disability, mental disability, medical condition, pregnancy, marital status, sex, age, sexual orientation, military and veteran status, registered domestic partner status, genetic information, gender, gender identity, gender expression, or any other characteristic protected by applicable law.