Site Reliability Engineer

Posted 15 Days Ago
Be an Early Applicant
Hiring Remotely in Tel Aviv, ISR
Remote or Hybrid
Entry level
Cloud • Security
We develop tools and products to combat modern web and cloud-based threats.
The Role
Seeking a Site Reliability Engineer to enhance production reliability practices, focusing on observability, incident response, and automation for improved user experience and system resilience.
Summary Generated by Built In
Description

Guardio is on a mission to redefine consumer cybersecurity for the modern internet.



We operate at consumer scale, protecting millions of people every day across devices, accounts, and digital touchpoints. In a world where phishing, fraud, and AI-powered scams evolve overnight, Guardio stays ahead of the curve.

We move fast, think deeply, and build with purpose. Our culture is rooted in transparency, feedback, and collaboration along with shared wins, team dinners, company trips, and good times.

We’re a team of 100+ makers, doers, and boundary-breakers. If you’re ready to tackle meaningful challenges, grow at lightning speed, and help shape the next frontier of online safety, you belong here.

Let's cut to the chase. What's the job?

We're looking for a Site Reliability Engineer to own and establish Guardio's production reliability practice - across observability, alerting, SLOs, and incident response - and build it to support our next phase of scale. Your work will define how over a million users experience Guardio's product, how our engineers sleep at night, and how we build a production environment that's as resilient as the security product we deliver.

You will:

  • Define SLIs and SLOs with engineering leaders - translate reliability goals into measurable, actionable objectives across our key services. Help teams understand what good looks like in production.
  • Build AI-powered reliability tools - use LLMs and agents to correlate alerts, accelerate root cause analysis, and build a copilot for on-call engineers. AI is your force multiplier.
  • Improve observability across teams - build dashboards, tune alert thresholds, reduce noise, and ensure on-call means getting paged for the right reasons. Make observability actually useful.
  • Design and own on-call - establish our rotation, define escalation policies, write runbooks. Then build automated agents that monitor and begin mitigation before a human is even paged.
  • Automate toil, aggressively (create skills) - identify recurring manual operational work and replace it systematically. Not just scripts- intelligent automation that learns from incidents.
  • Own post-mortems - build a culture of learning from incidents. What broke, why, and what gets built to prevent recurrence.
  • Contribute to the full platform - CI/CD safety, deployment rollback, feature flags. Anything that helps engineers ship faster with less risk to our users.

Sounds great! Am I the right fit?

We're not checking boxes. We're looking for a specific kind of person.

You're probably a great fit if:

  • You're a builder at heart. You don't just operate systems - you build the tools that make systems better. You have something to show: a repo, a demo, a post-mortem you wrote, a system you built because it needed to exist.
  • You have strong software engineering roots. You've written production code. You understand distributed systems, APIs, and failure modes from the inside out.
  • You think in outcomes, not tasks. "I resolved the incident" is not a win. "I reduced MTTR by 50% and prevented the same incident from ever happening again" - that's a win.
  • You're AI-native. You already use AI tools to move faster. You've probably built something with LLM APIs, LangChain, or custom agents. And critically: you know when to verify the output before trusting it.
  • You make good calls under uncertainty. You've been the person in the room when things were broken and the data was unclear. You didn't freeze.

Talk nerdy to me.

Don't mind if we do. Our tech stack:

  • Cloud: GCP - GKE, Pub/Sub, BigQuery, Cloud Functions. We are all-in.
  • CI/CD: Github actions, Terraform + Argo CD
  • Observability: Datadog (our primary)
  • Languages: Python, TypeScript, GO
  • Data: MySQL, Redis, BigQuery, ClickHouse

Thinking of a great addition? Let's do it.

Curious about our stack and how we build things? Check this out

Skills Required

  • Strong software engineering skills
  • Experience with distributed systems and APIs
  • Familiarity with GCP and CI/CD practices
  • Proficiency in Python, TypeScript, or GO
  • Experience with Datadog for observability
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
144 Employees
Year Founded: 2018

What We Do

We develop tools and products to combat modern web and browser threats. The Guardio extension now protects over 1M+ users from phishing, scams, and malicious extensions. Our team blends deep Cyber Security expertise, product, and marketing to bring Guardio protection to as many individuals and SMBs as possible, all while providing a slick and easy user experience.

Similar Jobs

Remote
ISR
45 Employees

Akamai Technologies Logo Akamai Technologies

Senior Site Reliability Engineer

Cloud • Security • Software • Cybersecurity
In-Office or Remote
2 Locations
10285 Employees

Akamai Technologies Logo Akamai Technologies

Senior Site Reliability Engineer

Cloud • Security • Software • Cybersecurity
Remote or Hybrid
2 Locations
10285 Employees

Similar Companies Hiring

Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees
Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Milestone Systems Thumbnail
Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
Lake Oswego, OR
1500 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account