Principal Site Reliability Engineer

Posted 15 Days Ago
Be an Early Applicant
Trondheim, Trøndelag, NOR
In-Office
Senior level
Artificial Intelligence • Software
The Role
The Principal Site Reliability Engineer will maintain reliability of Vespa Cloud systems, lead incident responses, define SLOs, and improve SRE practices. They will work in an environment that emphasizes automation and collaboration, participating in on-call rotations.
Summary Generated by Built In

Does it sound interesting to work on an open source platform managing the data and real-time search and inference for some of the largest companies in the world? Would you thrive on keeping large, globally distributed systems reliable, fast, and observable — and on building the practices and tooling that let a small team operate at massive scale? If so, we want you to join our team at Vespa.ai as a Principal Site Reliability Engineer!

About Vespa.ai:
Vespa.ai is a team of passionate builders. We maintain and develop the Apache 2.0 licensed open-source AI search platform Vespa. 

Vespa is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and structured data search, all in a single query. Integrated machine-learning model inference enables the application of AI to make sense of data in real time. Together with Vespa’s proven scalability and high availability, this empowers to create production-ready search applications at any scale and with any combination of features. Our users and customers are #1 in e-commerce, content, and financial services globally, and are used by companies such as Perplexity, Spotify, Yahoo, Wix, and many more.

In addition to our open-source platform, Vespa.ai develops and runs Vespa Cloud, a robust SaaS offering that allows businesses to harness the power of our technology with ease.

At Vespa.ai, we are extremely focused on automating everything we do to grow fast and maintain high quality. In all roles, we scale through technology, not simply by adding larger teams. We take pride in being small, nimble, and the most productive.

Position overview
At Vespa.ai, we embrace DevOps as a company culture, seeking to solve technical problems with automation and code rather than repetitive manual effort. For our Vespa Cloud production systems, we have had this mindset from day one.

We are seeking a Principal Site Reliability Engineer to join our team and help keep Vespa Cloud reliable, fast, and observable at global scale. This is a senior individual contributor role on the team that operates and improves our production systems. You will also help shape and develop our approach to SRE and DevOps as we grow. We are looking for a strong engineer who earns influence through contributions and has the ambition to take on greater responsibility over time. You will also participate in our 24x7 on-call rotation, approximately every third to fourth week.

At our Trondheim office, we work office-first: you will be based on-site most of the time, with the flexibility to work from home/remotely when needed, as agreed with your manager.

Responsibilities

  • Help ensure the reliability, availability, and performance of Vespa Cloud production systems running globally at scale.
  • Participate in a 24x7 on-call rotation (approximately every 3rd–4th week), lead incident response, and drive blameless postmortems through to durable fixes.
  • Help define and track SLOs/SLIs, and build proactive alerting, capacity planning, and remediation strategies.
  • Design and improve observability — metrics, logging, and tracing — across a large fleet.
  • Eliminate operational toil by solving problems with automation and code rather than manual effort.
  • Contribute to, and help shape, our SRE and DevOps practices and culture as the organization grows, sharing knowledge and mentoring across the team.
  • Work with the rest of the Vespa.ai developing team on reliability, scalability, and architecture.

Qualifications

  • 5–10 years building and operating large-scale production systems, with deep SRE/DevOps experience.
  • Solid programming skills in Java, Python, Go, or similar languages.
  • Good understanding of sound software engineering principles and practices.
  • Experience with cloud platforms (AWS, Azure, or GCP).
  • Solid understanding of networking, operating systems, distributed systems, and security principles.
  • Proven incident management and on-call experience.
  • A track record of influencing technical direction and improving how teams work — not just executing tickets.
  • Excellent problem-solving and analytical skills, and the ability to lead through influence as well as work independently.

Desired Skills

  • Experience with Infrastructure as Code tools such as Terraform, Tofu, Spacelift, etc.
  • Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry, ELK).
  • Experience with CI/CD tooling such as GitHub Actions, Buildkite, etc.
  • Experience operating data-intensive or stateful systems at scale.
  • Experience defining SLOs and establishing reliability programs.
  • Ambitions beyond pure SRE — an interest in growing, over time, into a technical leadership role.

Some of Our Tools and Services

  • JumpCloud, Google Workspace, and Slack
  • GitHub Enterprise Cloud (including GitHub Actions)
  • Jira Cloud and Jira Service Desk
  • StrongDM, Grafana, Spacelift, and Buildkite
  • AWS, GCP, and Azure

Why Join Us:

  • Opportunities for professional growth and development as part of one of Europe’s most exciting start-ups!
  • Be part of a cutting-edge team working on innovative search and recommendation technology.
  • Work on a team where we don’t believe in silos between engineers; there aren’t “developers”, “ops people”, and “sysadmins”. We’re all engineers solving problems the smart way together!
  • Competitive salary and benefits.

Note: Vespa.ai is an equal-opportunity employer. We are committed to creating an inclusive environment for all employees. We believe in fostering a collaborative and inclusive environment where every team member has the opportunity to make a significant impact.

Skills Required

  • 5-10 years building and operating large-scale production systems
  • Solid programming skills in Java, Python, Go, or similar languages
  • Experience with cloud platforms (AWS, Azure, GCP)
  • Proven incident management and on-call experience
  • Excellent problem-solving and analytical skills
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Trondheim
0 Employees
Year Founded: 2023

What We Do

Vespa.ai operates Vespa Cloud - used by companies to run Big Data serving with AI, online. We maintain the Vespa open-source project, continuously released and used by organizations with high performance, availability, and functional requirements. We are hiring! See the Jobs page, or visit our website.

Similar Jobs

Deepgram Logo Deepgram

Research Staff, LLMs

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
In-Office or Remote
49 Locations
150 Employees
150K-250K Annually

Deepgram Logo Deepgram

Account Executive

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
In-Office or Remote
28 Locations
150 Employees

Mondelēz International Logo Mondelēz International

Director Planning Transformation

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
27 Locations
90000 Employees

Pfizer Logo Pfizer

Platform Engineer

Artificial Intelligence • Healthtech • Machine Learning • Natural Language Processing • Biotech • Pharmaceutical
In-Office or Remote
36 Locations
121990 Employees
65K-109K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account