Paradise Media Jobs

Senior Site Reliability Engineer (AI-Native)

Paradise Media

Senior Site Reliability Engineer (AI-Native)

Reposted 24 Days Ago

Be an Early Applicant

Saint Julian's, MLT

In-Office

Senior level

Digital Media • Marketing Tech • Software

The Role

Lead reliability, performance, and security of web platforms while enhancing automation and AI-assisted operations. Own critical production environments and mentor teams.

Summary Generated by Built In

Paradise Media is a fast-growing performance marketing company behind some of the most successful affiliate and iGaming brands in the world. We run a global network of high-authority sites across casino, sports, and entertainment built on data, experimentation, and top-tier SEO.

We’re a private company with strong capital reserves and no outside investors, making us a stable, independent, and fast-moving place to grow your career. You’ll work directly with the CEO and leadership team, have a real voice in strategy, and see your ideas go live fast.

We’re scaling quickly to become one of the largest privately-owned companies in iGaming. A team where smart, driven people can have a massive impact and build something enduring.

About the role

We are seeking a Senior AI-Native Site Reliability Engineer to lead the reliability, performance, security, automation, and operational maturity of a growing portfolio of high-performing web platforms and digital products.

This role is ideal for a pragmatic senior reliability engineer who can operate and improve production systems, automate repetitive work, use AI safely to accelerate operations, understand performance and security deeply, and communicate clearly during incidents.

You will combine senior-level SRE, DevOps, infrastructure, security, and platform engineering expertise with a modern AI-first approach to operations. You will be expected not only to maintain systems, but to improve how they are designed, monitored, deployed, secured, and operated.

The successful candidate will be comfortable owning critical production environments across varied technology stacks, leading incident response, improving platform resilience, mentoring others, reducing operational toil through automation, and using AI tools responsibly to accelerate analysis, documentation, monitoring, debugging, remediation, and continuous improvement.

Roles & Responsibilities:

Reliability & Operational Ownership

Own uptime, performance, scalability, and resilience of production web platforms and supporting infrastructure.
Define and improve SLIs, SLOs, error budgets, HA, fault tolerance, DR, and graceful degradation.
Lead capacity planning, identify single points of failure, and act as senior technical owner during high-severity incidents.

Performance Engineering & Scalability

Lead optimization across application, infrastructure, database, caching, CDN, and edge layers (Redis, Varnish, Cloudflare or similar).
Establish benchmarks, regression checks, dashboards; reduce technical bloat across code, dependencies, assets, and infrastructure.
Align performance work with SEO, product, and commercial impact.

AI-Native Operations & Automation

Lead safe, practical AI-assisted workflows for log analysis, incident investigation, runbook creation, monitoring, security triage, and postmortems.
Automate repetitive ops via scripts, IaC, and AI-assisted tooling; build anomaly detection, alert triage, and operational reporting workflows.
Create reusable prompts, playbooks, and templates; define guardrails for data sensitivity, access control, human approval, and auditability.

Monitoring, Observability & Incident Management

Own monitoring/alerting across apps, infra, databases, caches, queues, CDNs, cloud services, and critical user journeys.
Design actionable dashboards and alerts that reduce noise and improve MTTD/MTTR.
Lead incident response, RCA, postmortems, and preventive actions; mentor on troubleshooting and calm communication under pressure.

Security, Resilience & Platform Hardening

Own production security posture: WAF, SSL, vulnerability management, malware/bot mitigation, threat detection, and remediation.
Harden servers, databases, cloud, containers, CI/CD, secrets, and production access; manage secure dependency and patching processes.
Maintain backup, recovery, and DR practices; contribute to security incident response, containment, and prevention.

Infrastructure, Cloud & Platform Engineering

Design and operate hosting/runtime environments across varied stacks (web/app servers, databases, caches, queues, containers, cloud).
Automate backups, updates, deployments, provisioning, and health checks using Ansible, Terraform, Docker, Kubernetes, Jenkins, GitHub Actions, or similar.
Support AWS, GCP, Azure, or modern managed hosting; set infrastructure standards balancing reliability, security, performance, and cost.

DevOps, Release Engineering & Developer Enablement

Lead CI/CD design, staging environments, rollback strategies, progressive delivery, and deployment observability.
Partner with developers to embed reliability, performance, and security into the SDLC; build tooling and runbooks for safer shipping.

Documentation, Collaboration & Technical Leadership

Maintain runbooks, troubleshooting guides, architecture notes, and operational playbooks (AI-assisted where useful, technically validated).
Act as senior technical partner to engineering, product, SEO, and business stakeholders; mentor engineers and shape ops standards.

Requirements:

Preferred Experience

6+ years in SRE, DevOps, Infrastructure, Platform, or Security Engineering.
Operating high-traffic web platforms, SaaS, SEO/content-heavy, affiliate, publishing, media, or e-commerce environments.
Cloudflare, edge caching, WAF, CDN optimization, bot mitigation; AI-assisted ops or agentic engineering workflows.
Leading high-severity incident response; defining SLOs, postmortems, runbooks; FinOps / cloud cost optimization.
Certifications in AWS, GCP, Azure, Linux, Kubernetes, or security are a plus.

Required

Senior-level experience in SRE, DevOps, infrastructure, platform engineering, or production operations.
Proficiency in Python, Bash, PHP, JavaScript/TypeScript, Go, or similar; strong Linux server administration.
Experience with web/app servers, databases, caches, queues, CDNs, cloud (AWS/GCP/Azure), and production traffic flows.
Strong Git, CI/CD, deployment automation, rollback, and release management; solid DNS, SSL, networking, and load balancing fundamentals.
Proven ability to troubleshoot complex production issues using logs, metrics, traces, and profiling—and to own systems without close supervision.

AI-Native Skills

Practical use of AI for debugging, documentation, scripting, analysis, and workflow automation, with strong judgment on validation.
Ability to design safe, human-in-the-loop AI workflows and reusable prompts/playbooks; sound judgment on privacy, access, and data sensitivity.

Performance & Observability

Hands-on with Datadog, New Relic, Grafana, Prometheus, Cloudflare Analytics, OpenTelemetry, Lighthouse, WebPageTest, or similar.
Strong grasp of caching, DB tuning, asset optimization, front-end and backend performance, edge delivery, SLOs/SLIs.

Security & Resilience

Production security practices: access control, WAF, vulnerability management, secrets, patching, incident response.
Backup strategy, recovery testing, DR planning; bot mitigation, dependency risk, malware detection, threat monitoring.

Automation & DevOps

Ansible, Terraform, Jenkins, Docker, Kubernetes, GitHub Actions; IaC, containerization, orchestration, configuration management.

Communication & Leadership

Calm incident leadership; clear technical communication to technical and non-technical stakeholders; mentoring and knowledge sharing.

Success in This Role Looks Like

Platforms are faster, more reliable, and more secure; monitoring is actionable and incidents are managed calmly with meaningful follow-up.
Manual work shrinks through automation and AI-assisted workflows; developers ship more safely; risks are caught before they impact the business.

Our Benefits:

We offer a competitive salary, and the opportunity to work with a talented and passionate team in a fast-paced, dynamic environment.

Skills Required

6 years in SRE, DevOps, Infrastructure, Platform, or Security Engineering
Senior-level experience in SRE, DevOps, infrastructure, platform engineering, or production operations
Proficiency in Python, Bash, PHP, JavaScript/TypeScript, Go, or similar
Strong Linux server administration
Experience with web/app servers, databases, caches, queues, CDNs, cloud (AWS/GCP/Azure)
Proven ability to troubleshoot complex production issues

View all jobs at Paradise Media

View Paradise Media Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: San Juan, PR

146 Employees

Year Founded: 2019

What We Do

Some companies merely exist in the digital space. We create impactful digital marketing campaigns that empower consumers to make informed choices on the products that shape their lives. At the same time, we help developing companies reach their target audiences in a cost-effective manner, allowing them to grow as quickly as possible. By staying on top of emerging trends in the industry, we ensure that your content always has the maximum impact, for both your business and your customers.