AI Quality Engineer

Posted 2 Hours Ago
Be an Early Applicant
Office, Lilongwe, Central Region, MWI
Hybrid
Senior level
Software • Automation
The Role
Design and run prompt-based and adversarial test scenarios for agentic LLM features, build evaluation frameworks and rubrics, identify failure modes and hallucinations, define quality metrics, and partner with Product and Engineering to improve AI behavior before release.
Summary Generated by Built In
About Rootly

At Rootly, we are on a mission to be the go-to way companies respond when things go wrong, helping every organization be more reliable. We do this by building an industry-leading incident management platform that allows companies around the world to consistently and quickly resolve incidents. We are not simply transforming an industry, we are carving an entirely new +$B segment ourselves and need incredible talent to achieve this ambitious goal together.

Customers love Rootly. Some of the fastest growing companies around the world such as NVIDIA, Figma, Canva, Tripadvisor, Squarespace and more rely on Rootly to power their critical incident management process. They obsess over our delightful enterprise-ready platform and unique partnership model. See why our customers have reviewed us 5 stars on G2.

Investors love Rootly. We are backed by some of the most respected funds in the world from Y Combinator to operators like the CTO of Dropbox and GitHub. We'd be happy to disclose our entire funding and profitability picture live during the interview. As a culture we relentlessly put transparency first. We conduct monthly financial reviews as a team so everyone has a pulse on the health of the business and publish what we are building in our weekly changelog.

Rootly is building the AI-native future of incident management, and we need someone who can push our AI to its limits before our customers do. As our AI Quality Engineer, you'll own the evaluation and optimization of Rootly's agentic AI features -- designing test scenarios, running adversarial prompts, interpreting outputs, and working directly with engineering and product to close the loop on performance.

This isn't traditional QA. You'll spend your days thinking like an attacker, a confused user, and a power user all at once -- probing how our AI agents reason, make decisions, and handle edge cases across complex incident workflows.

What you'll do

  • Design and execute prompt-based test scenarios that cover happy paths, edge cases, and adversarial inputs across Rootly's agentic AI features
  • Evaluate AI outputs for accuracy, relevance, consistency, and alignment with expected workflow behaviour
  • Build and maintain an evaluation framework; structured test libraries, scoring rubrics, and regression suites to track AI performance over time
  • Identify failure modes, hallucinations, reasoning gaps, and unexpected agent behaviours; document findings and work with engineers to resolve them
  • Partner with Product and Engineering on new AI feature releases, contributing to acceptance criteria and quality gates before launch
  • Define and track quality metrics (accuracy rates, failure frequency, regression trends) and report findings to stakeholders
  • Stay current on LLM evaluation techniques, prompt engineering best practices, and agentic testing methodologies

What we're looking for

  • +5 years in QA, product operations, AI/ML evaluation, or a closely related role
  • Hands-on experience testing or evaluating LLM-powered or agentic AI products
  • Strong prompt engineering instincts -- you understand how wording, context, and structure affect model behaviour
  • Comfortable writing scripts or working with evaluation tools (Python a plus; not required to be a full-stack engineer)
  • Sharp analytical thinking; you can spot a subtle reasoning failure and articulate exactly why it's a problem
  • Clear written communicator; able to translate AI behaviour findings for both technical and non-technical audiences

Nice to have

  • Familiarity with incident management, DevOps, or IT operations workflows is a strong asset
  • Experience with evaluation frameworks (e.g. LangSmith, PromptFlow, Braintrust, or similar)
  • Exposure to red-teaming or adversarial testing of AI systems
  • Comfortable writing E2E tests with Playwright
  • Background working at a B2B SaaS or developer-tools company
  • Familiar with mobile app testing (iOS/Android)

Why Rootly?

We’re not just another startup. We’re building something category-defining and want teammates who crave ownership, love solving hard problems, and thrive in a high-bar, high-impact environment.

Here’s what you can expect when you join Rootly:

  • Competitive compensation and early equity in a fast-growing, venture-backed company.
  • Comprehensive medical, dental, and vision coverage.
  • 3 weeks of vacation, plus unlimited sick and mental health days, and a company-wide end-of-year shutdown to recharge.
  • $500 stipend for home office setup.
  • Unlimited token usage and access to AI tools
  • A fast-moving, high-impact environment where your leadership and ideas directly shape the future of the company.

If this sounds like the kind of challenge and opportunity you’re looking for, apply now and let’s build something great together.

Rootly is an equal opportunity employer. We aim to create an environment where every team member at Rootly feels like they belong so they can have a greater impact on our business and customers. We do not discriminate on the basis of race, religion, colour, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Skills Required

  • 5+ years in QA, product operations, AI/ML evaluation, or a closely related role
  • Hands-on experience testing or evaluating LLM-powered or agentic AI products
  • Strong prompt engineering instincts
  • Comfortable writing scripts or working with evaluation tools (Python a plus)
  • Ability to spot subtle reasoning failures and clearly articulate issues
  • Clear written communication for technical and non-technical audiences
  • Familiarity with incident management, DevOps, or IT operations workflows
  • Experience with evaluation frameworks (e.g., LangSmith, PromptFlow, Braintrust, or similar)
  • Exposure to red-teaming or adversarial testing of AI systems
  • Comfortable writing end-to-end tests with Playwright
  • Background working at a B2B SaaS or developer-tools company
  • Familiarity with mobile app testing (iOS/Android)
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
35 Employees

What We Do

Rootly is an incident management platform on Slack that helps automate manual admin work during incidents. Leading companies such as NVIDIA, Squarespace, Canva, Grammarly, OpenSea, Figma, and countless others trust Rootly to build a consistent incident response process. See why they rate us 5 stars on G2: https://www.g2.com/products/rootly-manage-incidents-on-slack/reviews

Similar Jobs

Nabla Logo Nabla

Procurement Manager

Artificial Intelligence • Healthtech • Machine Learning
Hybrid
Office, Lilongwe, Central Region, MWI
82 Employees
70-70 Annually

Nabla Logo Nabla

Counsel

Artificial Intelligence • Healthtech • Machine Learning
In-Office
Office, Lilongwe, Central Region, MWI
82 Employees
70-70 Annually
In-Office or Remote
Office, Lilongwe, Central Region, MWI
437 Employees
135K-170K Annually
In-Office
Office, Lilongwe, Central Region, MWI
9919 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account