Senior Site Reliability Engineer for Datacraft team

Reposted 4 Days Ago
Be an Early Applicant
Hiring Remotely in Slovakia
Remote
42K-52K Annually
Senior level
Software
The Role
As a Senior Site Reliability Engineer at Bloomreach, you will oversee and enhance the reliability of the data platform, ensuring observability and efficient operations of data services. Responsibilities include building ecosystems for data engineers, managing infrastructure, and supporting incident resolution while developing a culture of reliability and responsiveness across the infrastructure.
Summary Generated by Built In
Bloomreach is building the world’s premier agentic platform for personalization.We’re revolutionizing how businesses connect with their customers, building and deploying AI agents to personalize the entire customer journey.
  • We're taking autonomous search mainstream, making product discovery more intuitive and conversational for customers, and more profitable for businesses.
  • We’re making conversational shopping a reality, connecting every shopper with tailored guidance and product expertise — available on demand, at every touchpoint in their journey.
  • We're designing the future of autonomous marketing, taking the work out of workflows, and reclaiming the creative, strategic, and customer-first work marketers were always meant to do.
And we're building all of that on the intelligence of a single AI engine — Loomi AI — so that personalization isn't only autonomous…it's also consistent.From retail to financial services, hospitality to gaming, businesses use Bloomreach to drive higher growth and lasting loyalty. We power personalization for more than 1,400 global brands, including American Eagle, Sonepar, and Pandora.
Become a Senior SRE for Bloomreach!

Join the newly form Datacraft team — the team building the next-generation data platform for Bloomreach Engagement. Datacraft owns three interconnected domains:

  • Data Warehouses (~60%) — making Bloomreach data first-class in customer DWHs (Snowflake, BigQuery, Databricks). The strategic goal for 2026–27 is to use DWHs to exponentially accelerate data adoption.
  • Loomi Analytics Agent (~20%) — evolving Loomi Analytics into an agentic analytics assistant that can explore data across systems, explain insights, and act on them.
  • Dashboards & Analytics Stack (~20%) — moving Engagement reporting onto DWH-backed, modern analytics stacks (semantic layers, headless BI tools).

As a Senior SRE, you will be the reliability backbone of this AI-first data team. Your work will directly impact the deployments, pipelines, reliability, and observability of pipelines and services that hundreds of enterprise customers depend on — from data exports into Databricks and BigQuery, to the  AI agent Loomi uses to surface insights.

Datacraft is an AI-first team. We believe code is a commodity and expect every engineer to fluently use coding agents (e.g., Cursor, Claude Code, Copilot, Gemini CLI) as a core part of their daily workflow. The ability to leverage AI tooling to accelerate development, prototyping, and problem-solving is not optional — it's foundational. Working in one of our Central European offices (Bratislava, Praha, Brno) or from home on a full-time basis, you'll become a core part of the Engineering team.

What challenge awaits you?

As a P3 (Senior) SRE at Bloomreach, you are an independent professional — expert in reliability engineering, able to decompose objectives into actionable infrastructure improvements, and lead initiatives end-to-end with minimal day-to-day guidance.

We need you to build and operate an ecosystem where data engineers can safely and efficiently develop, debug, and operate data-intensive jobs and services — spanning Kafka ingest pipelines, Iceberg data lakes, multi-DWH exports, Databricks deployment and orchestration (Airflow / Cloud Composer), and agentic AI workloads.

Your responsibilitiesa. Platform reliability & observability
  • Build and maintain the reliability ecosystem where engineers can safely develop, debug, and operate DataCraft services running on GCP and Kubernetes (DataProc, Cloud Composer, BigQuery, Snowflake/Databricks connectors).
  • Ensure end-to-end observability across the full data platform — from Kafka ingest through GCS/Iceberg staging, Airflow orchestration, to Databricks and BigQuery destinations — enabling the team to catch missing loads, SLA breaches, and data drifts before customers notice, or costs drift.
  • Drive scalability so services can scale vertically and horizontally based on operational and telemetric data (OpenTelemetry, Prometheus, Victoria Metrics).
  • Maintain team health dashboards and alerting (Grafana, PagerDuty, Sentry).
b. Infrastructure as Code & deployments
  • Own and evolve Terraform-based infrastructure for DataCraft services.
  • Automate deployments, instance setup, and operational runbooks to eliminate manual/semi-manual steps.
  • Maintain CI/CD pipelines (GitLab) with linters, security scans, and code quality checks, AI code reviews, enabling engineers to produce high-quality MRs.
c. Security & compliance
  • Help the team fulfill security requirements for ISO and SOC2 audits by enforcing security principles: key distribution, key rotation, authorization & authentication at the service level, data encryption in transit, data isolation, resource limitations, and audit logs.
  • Ensure data access controls are properly enforced across multi-DWH environments (BigQuery, Snowflake, Databricks).
d. Incident management & L3 support
  • Participate in and drive L3 on-call rotation and incident resolution for DataCraft services.
  • Contribute tooling for debugging, troubleshooting, and performance testing of data pipelines and orchestration layers.
  • Use telemetry data and distributed tracing to navigate complex, distributed service architectures.
e. Agentic platform reliability
  • Ensure reliability and observability of the Loomi Analytics Agent data infrastructure — LLM API gateway performance, MCP server health, and evaluation pipeline availability.
  • Monitor and alert on data quality issues that could introduce inconsistencies or hallucinations in Loomi's responses — making the agent's data access patterns reliable and debuggable.
Our tech stack

Languages: Python (primary), Go, SQL Messaging & streaming: Apache Kafka Storage & databases: Databricks, BigQuery, Apache Iceberg, GCS, Mongo, Redis Data processing & orchestration: Apache Spark, DataFlow, Airflow / Cloud Composer Infrastructure: GCP, Kubernetes, Terraform AI / Agentic: LLM APIs, MCP, agent orchestration frameworks Observability: Grafana, Prometheus, Victoria Metrics, PagerDuty, Sentry, OpenTelemetry CI/CD & tooling: GitLab, Jira, Confluence AI coding agents: Cursor, Claude Code

Your qualificationsProfessional experience

Impact

  • You can articulate how your contributions transformed the way engineers work and fostered a strong SRE/DevOps culture.
  • You can demonstrate how impactful reliability work connects to business success and customer outcomes.

Ownership

  • You embrace the you build it, you run it principle — you love owning what you ship.
  • You are cost-aware: effective vertical and horizontal autoscaling and detailed telemetry insights are how you demonstrate mindfulness of cloud spend.

Systematic approach

  • Infrastructure as Code is the only thing that brings stability into chaos
  • You design for failure: SLOs, error budgets, and runbooks are first-class artifacts, not afterthoughts.

Data-driven

  • You use telemetry and metrics to give engineers actionable feedback on how applications and services behave.
  • You can navigate complex data platform architectures using distributed tracing and debugging.

Technical skills

  • Solid hands-on experience with GCP (BigQuery, DataProc, Cloud Composer, GCS) and Kubernetes.
  • Experience with Python; Go is a strong advantage.
  • Familiarity with data pipeline technologies (Kafka, Airflow/Cloud Composer, Spark, Iceberg) — you don't need to write ETL code, but you need to operate it reliably and know when something is wrong.
  • Fluent use of AI coding agents (Cursor, Claude Code, Copilot, Gemini CLI, or similar) — you already use these tools daily to accelerate work.
  • Comfortable with on-call rotation and 24/7 incident response.
  • Remote-first mindset — you know how to be effective in distributed teams.
  • You are able to learn and adapt — essential when exploring new tech or navigating our growing codebase.
Strongly preferred
  • Experience operating single-DWH environments (Snowflak, Databricks or BigQuery).
  • Familiarity with agentic/LLM workloads — API reliability, latency SLOs, trace observability for AI systems.
  • Experience with open table formats (Iceberg, Delta Lake) in production environments.
  • Exposure to data security and compliance in the context of customer-facing DWH integrations (consent, data retention, PII handling).
Personal qualities
  • Ownership & accountability — you take issues from detection through to resolution and follow-up prevention.
  • Systematic thinking — you identify root causes, not symptoms, and document your findings so the team learns.
  • Collaboration & communication — you explain trade-offs and constraints clearly to both engineers and non-engineers.
  • Bias for reliability — operational excellence (SLOs, oncall friendliness, proactive alerting) is not a chore, it's your craft.
  • Continuous improvement mindset — you are comfortable iterating, revisiting assumptions, and improving incrementally.
  • Comfortable operating remote-first in a distributed team across Central Europe.
Your success story

In 30 days:

  • Get to know the DataCraft team, the company, and the most important processes.
  • Set up your local and GCP development environment and complete the Engagement engineering onboarding.
  • Understand the current state of DataCraft services: pipelines, orchestration, observability gaps, and on-call runbooks.

In 90 days:

  • Start contributing to the L3 on-call rotation, handling incidents, troubleshooting, and debugging — which will sharpen your understanding of the platform and surface fresh improvement ideas.
  • Deliver your first meaningful reliability improvement: an observability enhancement, a deployment automation, or an SLO definition for a key DataCraft service.

In 180 days:

  • Own the reliability posture of at least one DataCraft domain end-to-end — able to independently design, operate, and continuously improve it.
  • Drive measurable improvements in MTTR, alert signal-to-noise ratio, or deployment confidence across the team.
  • Be a trusted reliability partner in architecture discussions — your input shapes how new DataCraft services are designed for operability from day one.

#LI-KP1

The pay range actually offered will take into account a variety of potential factors considered in compensation, including but not limited to skills, qualifications, geographic location, accomplishments, experience, credentials, internal equity and business needs, and may vary from the range listed above.

Base Salary Range
€41.600€52.000 EUR
More things you'll like about Bloomreach:Culture:
  • A great deal of freedom and trust. At Bloomreach we don’t clock in and out, and we have neither corporate rules nor long approval processes. This freedom goes hand in hand with responsibility. We are interested in results from day one. 
  • We have defined our 5 values and the 10 underlying key behaviors that we strongly believe in. We can only succeed if everyone lives these behaviors day to day. We've embedded them in our processes like recruitment, onboarding, feedback, personal development, performance review and internal communication. 
  • We believe in flexible working hours to accommodate your working style.
  • We work virtual-first with several Bloomreach Hubs available across three continents.
  • We organize company events to experience the global spirit of the company and get excited about what's ahead.
  • We encourage and support our employees to engage in volunteering activities - every Bloomreacher can take 5 paid days off to volunteer*.
  • The Bloomreach Glassdoor page elaborates on our stellar 4.6/5 rating. The Bloomreach Comparably page Culture score is even higher at 4.9/5
Personal Development:
  • We have a People Development Program - participating in personal development workshops on various topics run by experts from inside the company. We are continuously developing & updating competency maps for select functions.
  • Our resident communication coach Ivo Večeřa is available to help navigate work-related communications & decision-making challenges.*
  • Our managers are strongly encouraged to participate in the Leader Development Program to develop in the areas we consider essential for any leader. The program includes regular comprehensive feedback, consultations with a coach and follow-up check-ins.
  • Bloomreachers utilize the $1,500 professional education budget on an annual basis to purchase education products (books, courses, certifications, etc.)*
Well-being:
  • The Employee Assistance Program -- with counselors -- is available for non-work-related challenges.*
  • Subscription to Calm - sleep and meditation app.*
  • We organize ‘DisConnect’ days where Bloomreachers globally enjoy one additional day off each quarter, allowing us to unwind together and focus on activities away from the screen with our loved ones.
  • We facilitate sports, yoga, and meditation opportunities for each other.
  • Extended parental leave up to 26 calendar weeks for Primary Caregivers.*
Compensation:
  • Restricted Stock Units or Stock Options are granted depending on a team member’s role, seniority, and location.*
  • Everyone gets to participate in the company's success through the company performance bonus.*
  • We offer an employee referral bonus of up to $3,000 paid out immediately after the new hire starts.
  • We reward & celebrate work anniversaries -- Bloomversaries!*

(*Subject to employment type. Interns are exempt from marked benefits, usually for the first 6 months.)

Excited? Join us and transform the future of commerce experiences!

If this position doesn't suit you, but you know someone who might be a great fit, share it - we will be very grateful!

Any unsolicited resumes/candidate profiles submitted through our website or to personal email accounts of employees of Bloomreach are considered property of Bloomreach and are not subject to payment of agency fees.

#LI-Remote

Skills Required

  • Solid hands-on experience with GCP and Kubernetes
  • Experience with Python; Go is a strong advantage
  • Familiarity with data pipeline technologies (Kafka, Airflow/Cloud Composer, Spark, Iceberg)
  • Fluent use of AI coding agents (Cursor, Claude Code, Copilot, Gemini CLI)
  • Comfortable with on-call rotation and 24/7 incident response

Bloomreach Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Bloomreach and has not been reviewed or approved by Bloomreach.

  • Fair & Transparent Compensation Pay is considered competitive versus peers and aligned with tech‑market norms for comparable roles and levels. Many roles cite compensation that feels fair and market‑appropriate.
  • Strong & Reliable Incentives Company‑wide performance bonuses follow a semi‑annual cadence, and go‑to‑market roles feature structured base/OTE plans. This predictable incentive design meaningfully augments base pay.
  • Leave & Time Off Breadth Quarterly company‑wide DisConnect Days, generous PTO practices, and paid volunteer time expand time off beyond standard holidays. These scheduled shutdowns are designed to enable genuine unplugging in a remote‑first setup.

Bloomreach Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Mountain View, CA
600 Employees
Year Founded: 2009

What We Do

Bloomreach is the leader in Commerce Experience™ Our Bloomreach Experience Platform (brX) competes in three core categories: Engagement (CDP and marketing automation), Content (headless content and experience management), and Discovery (e-commerce search, merchandising, recommendations, and SEO). We connect both customer data and product data to personalize all customer touch-points, leveraging our patented AI to recommend, predict, and segment. This empowers the marketer to create individual experiences, increase revenue, strengthen customer loyalty, and improve efficiency. With a global footprint, Bloomreach powers over 25% of all e-commerce experiences across the US and UK, and supports 300+ global enterprises including Neiman Marcus, CapitalOne, Staples, NHS Digital, Bosch, Puma, and Marks & Spencer.

Similar Jobs

Mondelēz International Logo Mondelēz International

o9 Change Manager MEU/CEE

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
7 Locations
90000 Employees

Mondelēz International Logo Mondelēz International

Change Manager o9 MEU, Demand Planning

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
9 Locations
90000 Employees

Mondelēz International Logo Mondelēz International

Change Manager o9 MEU, IBP

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
8 Locations
90000 Employees

Mondelēz International Logo Mondelēz International

o9 Change Readiness Lead

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
11 Locations
90000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account