A Practical Guide to User Acceptance Testing (UAT)
When the Portland-based footwear company Hilos launched its 3D-printed shoe line, it was doing something co-founder and CEO Elias Stahl said had never been done before. Sure, people were used to buying shoes online, but having those shoes tailored and customized to spec was unchartered territory.
So, before releasing the product into the wild, Hilos wanted to answer a fundamental question: Would users accept it?
In simplest terms, this is the question user acceptance testing (UAT) seeks to answer. Often the final phase of the software development cycle before a feature is released, user acceptance testing is a way to determine whether people will use a feature the way the product team intended. Are users engaged by it? Does it perform as designers expected? Has anything in the design landed outside the scope of users’ concerns?
“If you’re drilling down on something that is novel and innovative, it’s a cut-and-dry way to determine if this new or novel feature will be accepted by the majority of clients. If it is, how? If not, who is not accepting it and why?” Stahl said.
“In the end, UAT is the process of finding out if a user actually wants or needs a feature.”
While it is often conflated with other forms of user testing, most experts agree that user acceptance testing is a desirability check on a narrowly defined feature or piece of functionality.
“The easiest way to define user acceptance testing versus other types of testing is that, in the end, UAT is the process of finding out if a user actually wants or needs a feature,” explained Andrew Wachholz, a user experience design consultant at Designing4UX. “So, while usability tests may go well (people can use [the product] according to how we designed it) and functional tests may go well (we tried to break it, and it didn’t break), if the user rejects the feature when it is available to them, UAT has failed.”
Devising an effective approach to user acceptance testing depends on the maturity and resources of your company, the scope and type of release, your intended audience and your risk tolerance.
We spoke with founders, product managers and UX consultants across the tech community to lay out a strategic framework for planning and conducting user acceptance testing.
7 Steps to User Acceptance Testing
- Determine if it’s worth it.
- Scope and plan.
- Write acceptance criteria.
- Identify test methods and use cases.
- Select users.
- Conduct testing.
- Evaluate results.
Determine If It’s Worth It
UAT can be a useful way to gauge user affinity for a feature, but it is not for everyone, said Drew Falkman, director of product strategy at Los Angeles-based digital transformation consultancy Modus Create. One drawback, he told me, is that it can be time consuming and costly to recruit sample users or marshall the resources to conduct testing internally.
Third-party testing platforms, such as Respondent or UserTesting, can run anywhere from $25 to $150 per participant for a session. And because user acceptance testing is often conducted by teams of five to 10 people, it yields qualitative results that rarely reach statistical significance.
As an alternative to user acceptance testing — or as an added validation measure — many larger firms are starting to use “feature flagging” to gauge user behavior. With analytics tools like Launchdarkly and Optimizely, a product owner can launch a feature to a small percentage of users and measure engagement.
“So, for example, if you’re Amazon, you can flip the switch on something, I believe, for two to three minutes,” Wachholz said. “And you will have anywhere between 10,000 and 15,000 individuals that have interacted with it. You’re going to reach statistical significance almost immediately.”
Ana Grouverman, a product lead at Spotify, said product teams occasionally conduct user acceptance testing as a preemptive measure to save face in the event of a slight dip in metrics after a release.
“That’s probably not a good use of resources, because change aversion is a real thing, and you can accept that, maybe, you’ll take a one-week blip,” she told me.
For young companies such as Hilos, however, user acceptance testing can prove invaluable. The company offers shoppers an on-demand, customized experience much like that of visiting a tailor, where choices about material selection, styling and sizing are highly refined.
“If we hadn’t done this in a rigorous way, we would have been too much to too many people, and nothing to one.”
At first, the company wasn’t sure whether consumers accustomed to a traditional “add to cart” model would embrace the idea. UAT led to insights about choice optimization.
“Out of that experience, we decided we’re not going to offer completely bespoke shoes; we’re going to offer ... a wider size range. So, most shoes have about 10 different sizes, we offer 63,” Stahl said.
UAT also clarified Hilos’ true value to customers, which was not bespoke customization, but the comfort and style of their shoes.
“If we hadn’t done this in a rigorous way, we would have been too much to too many people, and nothing to one,” Stahl said.
Scope and Plan
The scope of UAT should be defined by the user story and feature specifications of what you’ve built, Wachholz told me. Say you’re building a feature for a mobile app to allow users to order pizza and have it delivered at a specified time. That’s the user story, and that’s the end result you’re testing against.
Developing meaningful acceptance criteria will help you create and test all aspects of the build contained within that story. Results of acceptance testing are often binary: A process either passed or it failed.
Typically, a product manager or user experience designer will develop the testing plan, beginning with a set of criteria aligned with feature specifications laid out at the start of the development cycle. In the case of a timed mobile order for pizza, acceptance criteria might look like this:
- Can a user order on iOS? On Android?
- Can they order it for specified times?
- Can they order a small pizza?
- Can they order a pizza with a thin crust, thick crust or cheesy crust?
- Can they order it from 10 miles away?
These determinations are intended to flush out overt deficiencies or limitations in the design.
According to Wachholz, planning for user acceptance testing involves vetting users, setting up testing environments, defining if tests are moderated or unmoderated and defining how testers will record the results.
Many smaller companies that don’t have access to a robust set of users conduct UAT internally with their own staffs or teams, he explained. Participants are assigned a list of use case scenarios that, with minimal guidance, they can complete. To ensure the user interacts with the feature in question — and that the interaction generates usable data — some tasks will be more scripted than others.
Design consultancies also provide guidance and technical assistance for user acceptance testing, Falkman said. Modus Create recently put together a four-week validation plan for AARP in preparation for the launch of a new feature in the company’s Money Map app for financial planning and debt management; it was part of an ongoing user research agreement between the companies.
During the first week, Modus Create worked closely with AARP’s product team to plot different paths users might take on the app.
- How would someone who doesn’t have enough money to pay their debts experience the app?
- How about someone who has just enough money?
- How about someone who has plenty of money and just needs to decide which debt to pay down first?
That same week, they outlined recruitment strategies to solicit input from five to 10 users for each path. They also drafted test scripts to provide instructions for participants.
Falkman pointed out that UAT can take weeks to months, depending on the size of the project, but ultimately the scope of the user story is what drives decision making.
“The bottom line is that everything starts at the mapping of user stories. Stories should be well-formed and not technical. Size matters,” he said.
Write Acceptance Criteria
In an email shared with Built In, Falkman explained acceptance criteria like this:
“Acceptance criteria should be born out of thinking through the product, and, if done well, the acceptance criteria should be the base for the QA team to conceive of and write tests. ... We recommend having a ‘definition of ready’ worked out with the team so that it can specify what it needs in order to just grab a story and go. This also ensures that whoever is writing the user stories has time for acceptance testing.”
If the user story for a Money Map customer hinges on their ability to check whether they’ve paid a debt for the current month, acceptance criteria would ensure the functional requirements are met and assess basic optimization and design considerations. If these elements pass muster, the feature is ready for prime time. For example:
- Does the page output data from the current month?
- Does it report if users have or have not paid?
- Is there a way to go back?
- Does it only accept numerical entries?
- Is there a maximum numerical entry?
- Is the font and spacing consistent with the rest of the site?
Most importantly, the acceptance criteria should reflect the user’s point of view.
“Presumably, there’s going to be QA testing as well. So, as a product owner, this is really to make sure that the [feature] works and it’s everything it needs to be, so that I can say, ‘Yes, ship it and move on to the next piece,’” Falkman wrote.
Identify Test Methods and Use Cases
Lauren Chan Lee, a product professional who has led teams at Care.com and StubHub, groups users into three main buckets: consumers, B2B clients and users internal to a company. Each requires a somewhat different approach.
When Care.com was developing a new user flow allowing internal operations teams to create care-center records, the user acceptance test was relatively straightforward: a day-long checklist Lee put together that a member of the operations team tested independently. Did the website update childcare, senior care and other service records as intended?
B2B cases are trickier. Clients can be very invested in feature changes and vocal in expressing their viewpoints. Care.com has a customer advisory council comprising key clients that Lee turns to for feedback on new releases and reports. She allocates a week ahead of a release for members of the council to conduct user acceptance testing.
“Some subset of [users] has to demonstrate a likelihood to use it in the way that you intended it to be used.”
For a large-scale consumer release, Lee might convene product, engineering, development and design teams for a “bug bash,” in which team members assigned to various features of a website overhaul — the buyer flow or seller flow, for instance — work through user acceptance testing together.
There is an important difference between testing a consumer product versus a piece of third-party software for a large organizational rollout, Grouverman told me. With the latter, you often have a captive audience, so the bar for acceptance is lower. To introduce a human resources system to record employee vacation days, for example, you might roll it out to a portion of employees and ask them if they’re willing to use it and if they encountered any glaring problems.
In other words, users are likely to accept a less-than-perfect payroll feature because, at some level, the decision has already been made. However, for a consumer product like Spotify’s, users must have an affinity for the change and be willing to embrace it quickly.
“Most critically, some subset of [users] has to demonstrate a likelihood to use it in the way that you intended it to be used,” she said.
The audience you select for testing depends on what you’ve built and what you’re seeking to learn, Wachholz explained in an email.
“If it’s a new minimum viable product feature, and you want to record engagement rates to indicate whether your team should continue to build the feature further, you might open testing to all users,” he wrote. “If it is an enhancement to an existing feature, it’s best to narrow your test subjects by analyzing usage and frequency. Your goal is to get the most qualified people’s eyes on the new enhancement.”
Ideally, you’re hoping to capture unbiased participants who accurately reflect the ages, demographics, use locations and behaviors of your target users — and who understand the software.
“For example, if it’s some sort of workflow that has to do with managing a bunch of expenses within QuickBooks, we would make sure we recruited people that knew QuickBooks so that we didn’t have to give them an overview on how the software works, because that would totally throw everything off,” Falkman said. “If there were certain steps involved to get the app to the state where they’re using it, we would make sure users understood that, and we would give them an opportunity to ask questions.”
If UAT is conducted in-house in a publicly accessible environment, the cost is fairly negligible — just the time of the team members involved. In this case, the UX team and product manager will distribute a script to internal volunteers, who go through a task list while being observed over video or in person. However, this approach is “the least desirable solution and can lead to biased results, as the team wants the build to succeed, not fail,” Wachholz wrote.
Alternatively, there are several third-party services that allow you to invite and schedule phone interviews, in-person interviews or online research sessions with paid participants. Typically, participants can be pre-vetted to ensure they resemble your target audience. However, there are risks to this approach as well, because paid participants can become vested in giving the type of measured feedback that keeps them desirable as testers. It’s an approach Wachholz advocates for firms that do not have ready access to a large sampling of users.
The best time to conduct user acceptance testing tends to be late in the software development cycle, when you have a prototype but haven’t yet sunk resources into making it functional and scalable, Grouverman told me.
While, in practice, testing tends to be a binary process, Grouverman said a more effective approach is to ensure all acceptance criteria lead to feasible possibilities. Unlike Wachholz, she doesn’t consider user acceptance testing a purely binary exercise.
“I would say the first principle is you have to have results that are going to be usable and, in an ideal scenario, you should not be doing acceptance testing to answer a binary ‘yes’ or ‘no’ question,” she said. “Instead, you should be doing it in order to give direction one way or another.”
Whether or not UAT is “binary,” however, may be a matter of semantics. Wachholz points out that user acceptance testing often is applied to evaluate whether a new feature “moves the needle.” For instance, is the experience improved because it takes a user less time to complete a task?
This is what happened when Designing4UX was hired by Contactually (which was since acquired by Compass) to conduct user acceptance testing on an app for real estate agents to record details of their client interactions.
“I’ll ask participants to ‘think aloud’ or ‘vocalize their trains of thought’ when they are performing a task.”
“It sounds counterintuitive, but we wanted to lower time in app and see an increase in usage,” Wachholz wrote. “Up-front investment in the app on the part of a new user was high (i.e. a barrier to entry). Our goal was to make it super efficient on first sign-in, so [users] spent less time in the app, but the value was significantly higher.”
Results confirmed what Designing4UX hypothesized — users spent more time on the app when log-in time was reduced. It wasn’t a binary “yes” or “no,” more like an assessment of “good” versus “better.”
This kind of outcome can be generated from an open-ended, Socratic questioning approach. During testing, users are typically given a script and asked to perform certain tasks. Observers might assess where they go in a flow, how much time they spend there and whether they select certain buttons or tabs.
“The idea is to not create bias or direct the subject in a particular way,” Wachholz wrote. “I’ll ask participants to ‘think aloud’ or ‘vocalize their trains of thought’ when they are performing a task. This way, I get more information about what they are thinking, versus silence and movement on a screen.”
Or, as Falkman put it: “Just saying, ‘Here’s where you are, and here’s where you want to finish,’ but not giving any instruction in terms of how to get there is ideal, because you want users to have their own struggles. Sometimes they’ll be like, ‘Well, should I click this button?’ And my answer is, ‘What do you think will happen when you click that button?’”
Evaluate the Results
At the most basic level, UAT should measure the success or failure of key criteria. Hilos’ six-month testing process involved more than 100 test subjects and included surveys, interviews and social and site engagement metrics. It tallied results in a master spreadsheet, which Stahl said continues to evolve. A clear timetable kept the project on track.
“We had very clear deadlines for when we wanted to release the product and why. What will it take to have 95 percent confidence in a positive reaction from these core demographics?” he said. “Setting those goals initially is important, both from a timeline perspective, as well as ‘What is the acceptance threshold you want among key groups?’”
Features are typically built around key performance indicators and business goals. If a feature fails to meet user acceptance criteria, the team needs to review the results and decide what needs to change. Sometimes, it’s a simple UI design tweak, while, in other cases, users flat out reject an entire feature. In the latter case, it’s important to set up a subsequent test to root out deeper issues.
If a feature passes a user acceptance test, it can be put on a release schedule.
“You typically don’t release a feature to 100 percent of the user base immediately. There are too many things that can go wrong — even with all your testing,” Wachholz wrote. “You might begin with 20 percent one week, 40 percent the next week, and so on.”
“What will it take to have 95-percent confidence in a positive reaction from these core demographics?”
It’s important to keep in mind that results of UAT tend to be qualitative and often need to be substantiated with quantitative testing during a beta release, Grouverman said.
Tools like hotjar or FullStory record user interactions in a beta environment and validate the findings of user acceptance testing. As usability consultant Jeff Sauro, founding principal of Measuring U, notes on a company blog: “You do not need a sample size in the hundreds or thousands, or even above 30, to use statistics. We regularly compute statistics on small sample sizes (less than 15) and find statistical differences.”
While pointing out that peer-reviewed journals usually deem a result statistically significant based on a p-value less than .5 — meaning there is less than a 5 percent probability an observed result was due to chance — that level of confidence is rarely needed to launch a release. He refers to the “magic number five” as one baseline to guide testing: For easily detectable problems in an interface, a test of just five users will yield “an 85 percent chance of seeing problems if they affect at least 31 percent of the population.”
The bottom line is: The greater the sample size, the higher the probability the results will be reliable. But even if you do not have the time or budget to pull together an ideal sample of users, user acceptance testing can be valuable as a barometer of a feature’s appeal. Think of it as part of an ongoing product development process that will continue to evolve after launch.
“Software companies are always launching new features. It requires ongoing acceptance testing — which, I believe, should come earlier on,” Stahl said. “A lot of companies get calls from customer support. The more they get asked for something, the higher it goes on the list. And then they pump them out. But they don’t actually do a lot of testing; it’s not baked into the R&D process. It’s a very rudimentary kind of upvote or downvote.”
That, according to Stahl, is a mistake.