Rootly

AI Quality Engineer

Reposted 7 Hours Ago

Be an Early Applicant

Office, Lilongwe, Central Region, MWI

Hybrid

Senior level

Software • Automation

The Role

Design and run prompt-based and adversarial test scenarios for agentic LLM features, build evaluation frameworks and rubrics, identify failure modes and hallucinations, define quality metrics, and partner with Product and Engineering to improve AI behavior before release.

Summary Generated by Built In

About Rootly

At Rootly, we are on a mission to be the go-to way companies respond when things go wrong, helping every organization be more reliable. We do this by building an industry-leading incident management platform that allows companies around the world to consistently and quickly resolve incidents. We are not simply transforming an industry, we are carving an entirely new +$B segment ourselves and need incredible talent to achieve this ambitious goal together.

Customers love Rootly. Some of the fastest growing companies around the world such as NVIDIA, Figma, Canva, Tripadvisor, Squarespace and more rely on Rootly to power their critical incident management process. They obsess over our delightful enterprise-ready platform and unique partnership model. See why our customers have reviewed us 5 stars on G2.

Investors love Rootly. We are backed by some of the most respected funds in the world from Y Combinator to operators like the CTO of Dropbox and GitHub. We'd be happy to disclose our entire funding and profitability picture live during the interview. As a culture we relentlessly put transparency first. We conduct monthly financial reviews as a team so everyone has a pulse on the health of the business and publish what we are building in our weekly changelog.

Rootly is building the AI-native future of incident management, and we need someone who can push our AI to its limits before our customers do. As our AI Quality Engineer, you'll own the evaluation and optimization of Rootly's agentic AI features -- designing test scenarios, running adversarial prompts, interpreting outputs, and working directly with engineering and product to close the loop on performance.

This isn't traditional QA. You'll spend your days thinking like an attacker, a confused user, and a power user all at once -- probing how our AI agents reason, make decisions, and handle edge cases across complex incident workflows.

What you'll do

Design and execute prompt-based test scenarios that cover happy paths, edge cases, and adversarial inputs across Rootly's agentic AI features
Evaluate AI outputs for accuracy, relevance, consistency, and alignment with expected workflow behaviour
Build and maintain an evaluation framework; structured test libraries, scoring rubrics, and regression suites to track AI performance over time
Identify failure modes, hallucinations, reasoning gaps, and unexpected agent behaviours; document findings and work with engineers to resolve them
Partner with Product and Engineering on new AI feature releases, contributing to acceptance criteria and quality gates before launch
Define and track quality metrics (accuracy rates, failure frequency, regression trends) and report findings to stakeholders
Stay current on LLM evaluation techniques, prompt engineering best practices, and agentic testing methodologies

What we're looking for

+5 years in QA, product operations, AI/ML evaluation, or a closely related role
Hands-on experience testing or evaluating LLM-powered or agentic AI products
Strong prompt engineering instincts -- you understand how wording, context, and structure affect model behaviour
Comfortable writing scripts or working with evaluation tools (Python a plus; not required to be a full-stack engineer)
Sharp analytical thinking; you can spot a subtle reasoning failure and articulate exactly why it's a problem
Clear written communicator; able to translate AI behaviour findings for both technical and non-technical audiences

Nice to have

Familiarity with incident management, DevOps, or IT operations workflows is a strong asset
Experience with evaluation frameworks (e.g. LangSmith, PromptFlow, Braintrust, or similar)
Exposure to red-teaming or adversarial testing of AI systems
Comfortable writing E2E tests with Playwright
Background working at a B2B SaaS or developer-tools company
Familiar with mobile app testing (iOS/Android)

Why Rootly?

We’re not just another startup. We’re building something category-defining and want teammates who crave ownership, love solving hard problems, and thrive in a high-bar, high-impact environment.

Here’s what you can expect when you join Rootly:

Competitive compensation and early equity in a fast-growing, venture-backed company.
Comprehensive medical, dental, and vision coverage.
3 weeks of vacation, plus unlimited sick and mental health days, and a company-wide end-of-year shutdown to recharge.
$500 stipend for home office setup.
Unlimited token usage and access to AI tools
A fast-moving, high-impact environment where your leadership and ideas directly shape the future of the company.

If this sounds like the kind of challenge and opportunity you’re looking for, apply now and let’s build something great together.

Rootly is an equal opportunity employer. We aim to create an environment where every team member at Rootly feels like they belong so they can have a greater impact on our business and customers. We do not discriminate on the basis of race, religion, colour, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Skills Required

5+ years in QA, product operations, AI/ML evaluation, or a closely related role
Hands-on experience testing or evaluating LLM-powered or agentic AI products
Strong prompt engineering instincts
Comfortable writing scripts or working with evaluation tools (Python a plus)
Ability to spot subtle reasoning failures and clearly articulate issues
Clear written communication for technical and non-technical audiences
Familiarity with incident management, DevOps, or IT operations workflows
Experience with evaluation frameworks (e.g., LangSmith, PromptFlow, Braintrust, or similar)
Exposure to red-teaming or adversarial testing of AI systems
Comfortable writing end-to-end tests with Playwright
Background working at a B2B SaaS or developer-tools company
Familiarity with mobile app testing (iOS/Android)

View all jobs at Rootly

View Rootly Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: San Francisco, California

35 Employees

What We Do

Rootly is an incident management platform on Slack that helps automate manual admin work during incidents. Leading companies such as NVIDIA, Squarespace, Canva, Grammarly, OpenSea, Figma, and countless others trust Rootly to build a consistent incident response process. See why they rate us 5 stars on G2: https://www.g2.com/products/rootly-manage-incidents-on-slack/reviews