Emotion AI Technology Has Great Promise (When Used Responsibly)
You’re at the airport, running late for a flight. You need to speak with an agent quickly, but everyone is occupied, with lines stretching endlessly.
So you go to the robot for help.
The robot assistant answers your questions over the course of a genuine, back-and-forth conversation. And despite the noisy environment, it’s able to register the stress in your voice — along with a multitude of other verbal emotional cues — and modulate its own tone in response.
That scenario, laid out by Rana Gujral, CEO of Behavioral Signals, is still a hypothetical — but it might be reality sooner than you think.
“Within the next five years, you’ll see some really amazing experiences come out.”
“Within the next five years, you’ll see some really amazing experiences come out,” he said.
Gujral isn’t in the robotics or chatbot game, but he is in the business of emotion AI: artificial intelligence that detects and analyzes human emotional signals. His company’s technology maps verbal information — such as tonality, vocal emphasis and speech rhythm — in call-center conversations in order to better match up representatives.
What Is Emotion AI?
Emotion AI isn’t limited to voice. Sentiment analysis — a natural language processing technique — detects and quantifies the emotional tenor of text samples, be they individual snippets or large-scale samples. It’s so matured that it’s now a common tool in industries ranging from marketing, for product review analysis and recommendation tailoring, to finance, where it can help forecast stock movements.
There’s also video signals. That includes facial expression analysis, but also things like gait analysis and gleaning certain physiological signals through video. (A person’s respiration and heart rate can be detected contactlessly, using cameras, under the right conditions.)
At the same time, emotion is fuzzy stuff. And applying some of these technologies to high-consequence situations can be deeply problematic. In fact, researchers at New York University’s AI Now Institute last year called for legislators to “prohibit use of affect recognition in high-stakes decision-making processes.” The most notorious example is a hiring system that uses job candidates’ facial expressions and voice patterns to determine an “employability score.”
“The idea of using facial expressions to assess people in job interviews — it just doesn’t have the backing of science,” said Daniel McDuff, a researcher at Microsoft who studies multimodal affective computing that analyzes facial movements alongside other signals, like physiological levels and body motion, for potential health applications.
That multimodality underscores a key point: our faces alone rarely — if ever — tell the whole story. To what extent facial expressions reliably communicate emotion remains hotly debated.
“I don’t think many would question that facial expressions have information, but to assume a straight mapping between what I express on my face and how I’m feeling inside is often much too simplistic,” added McDuff, who’s also a veteran of the MIT Media Lab’s Affective Computing Group — a pioneer in the field.
With such simultaneous promise and pitfall potential, we asked four experts to walk us through the state of emotion AI — how it works, how it’s being applied today, how it accounts for population differences, and where hard lines should be drawn to prevent its application.
- Seth Grimes - Natural language processing consultant, founder of Alta Plana and creator of the Emotion AI Conference
- Rana Gujral - Chief executive officer of Behavioral Signals
- Skyler Place - Chief behavioral science officer of Cogito
- Daniel McDuff - Principal researcher at Microsoft AI
Here are their thoughts, broken down by category: text-focused, voice-focused and video/multimodal emotion AI.
Text Emotion AI: NLP and Sentiment Analysis
Sentiment analysis refers to the application of natural language processing to text samples in order to determine whether the sentiments expressed are positive or negative, and to what degree. A common application is when companies use the technique to analyze posted reactions to their products or services.
How Does It Work?
Seth Grimes: A prevailing approach nowadays is using transfer learning for pre-trained models. There’s a huge pre-trained model, but then you do the so-called last-mile training, using your own data. By customizing it for your own uses, by doing that last-mile training, you get the accuracy that you want.
A company might have a solution for the hotel industry that contains a certain set of taxonomies. It understands that a hotel has rooms, service, a restaurant — it understands the structure of the thing being analyzed. But it doesn’t necessarily understand that, say, Hilton Hotels has certain branding, like the reward system is Hilton Honors, and so on. That kind of last-mile training used to be via customizing taxonomies and role sets. Nowadays, it’s transfer learning.
The Complexity of Emotion
Grimes: [Sentiment analysis] is a little bit controversial still, because there are questions about accuracy and usability — whether numbers actually correspond to real-world sentiment. To give an indicator of the complexity, I use an image in my presentations of Kobe Bryant smiling on the basketball court. And I ask, “How does this make you feel?” Well, Kobe Bryant died a year ago in a tragic crash. So [just] because Kobe Bryant is smiling, that doesn’t make you happy. Maybe it’s sad, but sadness about someone who has died is actually a positive sentiment. There’s a lot of complexity and subjectivity.
Clarabridge is a good example [of the technology’s advancement]. They do sentiment analysis, and over the last few years, have moved into emotion categorizations. It can be simple, like happy, sad, angry, or it can be hierarchical with lots of categories — as opposed to just positive or negative. The initial refinement was [just] positive or negative, on a scale of -10 to 10, to capture intensity of feeling — where, for example, “I’m furious” is more intense than “I’m angry.”
How, Exactly, Is Accuracy Defined?
Grimes: Accuracy in this world has two components: precision and recall. Precision — what’s your target? Are you trying to get the [overall] sentiment of a product review, either positive or negative? That’s imprecise. For instance, consider a positive review of a vacation stay. Well, what aspect? Did they like the room, the staff, dining options, location? So you get into what’s called aspect-based sentiment analysis. That’s much more precise [because] it’s more narrowly focused. You might have 99 percent accuracy deciding whether a review is positive or negative. But what’s really actionable is what the review is positive or negative about.
A few years ago, companies really started creating much more narrowly focused models — a different model for restaurants versus hotels versus consumer electronics. For example, maybe it’s good that my phone is thin. But if someone says the sheets at a hotel were thin, that’s not good. So you get into the notion that a model should be industry-trained or -focused.
“Maybe it’s good that my phone is thin. But if someone says the sheets at a hotel were thin, that’s not good.”
What Avenues Exist for Misuse or Abuse?
Grimes: Here’s a nightmare scenario for text: Maybe we can detect from what a person has written online or to [a text-based crisis line] that they might be suicidal. Well, what if their health insurer got ahold of that information? Or a car insurer takes facial codings from within a car and says: “That person is driving angry; I’m going to raise his rates, because he’s more likely to get into an accident.” Whether these are misuses or not depends on your perspective, but they are potential uses that people are not expecting of emotion data, and they’re potential abuses.
Audio & Voice Emotion AI
Rana Gujral runs Behavioral Signals; Skyler Place leads behavioral science at Cogito. Both companies develop voice emotion AI for call-center environments, but each operates a bit differently. Cogito’s technology is focused on providing real-time feedback to representatives; Behavioral Signals’ tech is geared toward finding the best match between agents and the people they call. Also, Behavioral Signals analyzes vocal information only, not the content of conversations, while Cogito analyzes both.
How Does It Work?
Skyler Place: The market historically has focused on natural language processing and sentiment analysis — and the technology to support NLP accuracy continues to improve. In parallel, there have been organizations, Cogito included, focused on what we call the honest signals, which are everything in the conversation besides the words — energy in the voice, pauses, intonation, the whole variety of signals that help us understand the intentions, goals and emotions that people have in conversations — a very, very rich signal of information.
What we’re starting to see for the first time is the melding of these two data streams. I think that’s going to be the technological leap forward — the ability to really combine the understanding of NLP with the honest signals — that’s going to give us a novel way to understand and improve the emotion in conversations as we go forward.
We have about 200 different signals that we utilize to recognize these behaviors. And then we link the behaviors to the outcomes that are valuable for [call-center] calls. That’s how we think about emotion — less about pure recognition, more about understanding the behaviors that allow you to not just measure, but influence, the emotions in an interaction.
How Do Cultural Differences Affect Readings?
Rana Gujral: For the most part, it’s a challenge to recalibrate the baseline. Take how we express excitement versus anger. They both potentially could be high-pitched, but there are subtle differences. And our brains are obviously very adept at identifying the difference between the two. Even if you’re watching a foreign-language movie without subtitles, you can tell if a character is angry or excited, just by tone. The question is, how do you codify that?
We found that, once you recalibrate that baseline for a new language data set, the variance from the baseline that identifies those specific signals is almost identical across all cultures. We just need 10 to 20 hours of data — a maximum of 50 hours of data — to recalibrate that baseline.
Place: It absolutely is an important issue, and it ties into the potential bias that you see in algorithms. With our approach, the system will work out of the box, because you have this ability to understand how people are speaking — we can measure whether you’re speaking English or Spanish, whether you’re in New York or the Deep South — but the contextual meaning of that speaking rate may be different. So what “good” is may be different, based on the goals of the conversation and the region that [speakers] are from. So we tackle this at a variety of different stages across the product life cycle.
First, when we build our data sets, we measure and account for some of these different variables, making sure we’re representing them in the data. Second, we have human annotators listen to calls and mark up, for example, if a person is speaking too quickly for a particular part of the call. We make sure we pick those individuals from a variety of backgrounds — different genders, ages, cultures, so that we aren’t just getting one perspective of what “good” is for a call.
When we deploy to a client, we go through a calibration period where we’ve listened to hundreds and thousands of calls for that particular culture and particular use case and confirm that the settings are appropriate. That gives an opportunity to turn the dials to make sure we haven’t deployed anything that’s going to be adverse.
The Complexity of Emotion
Gujral: Some attributes are super complex. There’s no system that can completely pinpoint sarcasm. But for the most part, you’re keying off on some essential interaction attributes.
“Some attributes are super complex. There’s no system that can completely pinpoint sarcasm.”
The essential emotions are very measurable — anger, happiness, sadness, frustration and neutrality. Then you’re looking at measuring positivity and arousal — which is a tone change — and behaviors like politeness, engagement, agitation. You can also build some KPIs using the specific domain data and metadata — like measuring quality of interaction or agent performance. They’re all built on these basic sorts of signals. We have quite a few signals. Some we do much more accurately than others. For anger, we can produce a high-90s percentage accuracy, then there are others that are a little bit harder, with more false reads.
What Avenues Exist for Misuse or Abuse?
Gujral: Yale University professor Michael Krauss [noted] in a 2017 research paper that humans are actually really adept at masking emotions in our facial expressions. So I personally feel that implementations based on facial expressions are more problematic. The worst thing there can be for an AI implementation is an inaccurate or ineffective implementation.
I think there are lines that we need to draw. First is a line of privacy and choice. The consumer needs to be aware and willing to partake. That’s very important. And there are other moral boundaries. Last year, when we were fundraising, we had a regent from a private company that works on the behalf of a government agency in Europe — a friendly European country, but they work for defense systems — and they wanted to apply our intent models to immigration. And we said no. If a choice is going to be made on, say, visa overstay or immigration policies, which could be life-altering, we don’t want to go there. I’m not going to say our technology is not that accurate, but that’s a much higher bar.
What Are the Challenges Ahead for Voice/Audio Emotion AI?
Gujral: Data is always a key challenge — the more, the better, and quality matters tremendously. In a contact-center set-up, you’re sort of guaranteed high-quality data; it’s all recorded over professional recording equipment with channel separation and very little noise. But this year was interesting. We’ve seen an explosion in calls, but many have been poor quality, because oftentimes agents are working from home.
Place: A really interesting technical challenge is the synchrony of signals, which has to do a bit with computational load and the processing of different systems. Because we’re focused on real-time guidance, we’ve put an enormous amount of effort into low-latency code. What’s been interesting as we start to integrate the nonverbal signal computations with natural language processing, the [NLP] has a much longer delay. It takes more computational processing and more time to describe the words that are happening. How you build a product that can combine those two different signals in a way that’s both accurate, but also actionable and timely, is a really interesting design question, as well as a back end, data-pathways question.
Video and Multimodal Emotion AI
How Does It Work?
Daniel McDuff: What’s made this possible is the investment in camera technology over the past 20 years — cameras that have high-quality sensors with low noise. Using a camera even with a simple signal-processing algorithm, if you analyze skin pixels, you can typically pull out a pulse signal and also detect respiration for a person who’s stationary.
But if you think about needing it to work when people are moving, maybe the lighting is changing, [plus] everyone has different skin pigmentation, facial hair — that’s where deep learning helps to make it more robust to all these sources of noise. The camera is sensitive enough to pick up the initial signal, but that often gets overwhelmed by different variations that are not related to physiological changes. Deep learning helps because it can do a very good job at these complex mappings.
How Do Cultural Differences Affect Readings?
McDuff: There are definitely differences across cultures. We see that in our own analyses of large-scale image data sets, but also in a lot of psychology research, both in self-reported data and in observational measurement.
However, those differences tend to be pretty small when compared to individual differences even within the same culture. For instance, my brother might be more expressive than me by quite a large margin even though we come from the same family. So there are large individual differences even within people from very similar backgrounds.
“There are definitely differences across cultures.... However, those differences tend to be pretty small when compared to individual differences even within the same culture.”
That doesn’t mean that understanding cultural differences isn’t interesting from a scientific and psychological perspective. But when it comes to modeling the data, ultimately that’s just one, in some cases quite small, source of variation, when there are many other large sources of variation — [situational] context, gender, background. The way you’re treated as you grow up influences how you behave, right? If people around you are not that expressive, you might end up not being that expressive.
Studying this is very interesting intellectually, but when it comes to actually putting [models] into practice, often I think the social context of the situation matters way more. If you can control for the context, then you can compare across cultures and see these differences. But if I take a video of someone doing karaoke in Japan and compare it to someone in an office in the United States, the Japanese person is going to seem super expressive. But that’s not really a fair comparison, right? Because the context is so different.
How Applicable Is Video-Based AI at an Individual Level?
McDuff: Many applications at the moment look at group responses, [such as] measuring how much people smile when they watch an advertisement. They typically don’t do that for one individual; they do that for 30 or 40 individuals, then average the data. Marketing folk find that data useful, and they often combine it with self-reported and other data. That’s fine. But in that case, they’re not really interpreting each individual’s underlying experience that much. They’re looking at what’s observed on the face and using that as a quantitative measure of responses to the content.
But when you get into something like health, you’re often looking at individuals’ behaviors, trying to understand if what you’re observing are perhaps symptoms, or effects of a medication. For instance, you may have someone with Parkinson’s disease taking medication that controls tremors. Say you were using video to track how quickly the medication wears off. You need to personalize that because different individuals might have different magnitudes and manifestations. So understanding a group-level signal is not that useful, because you’re trying to target an individual.
Most of the applications I’m interested in do involve this personalization. If we’re tracking heart rate variability, we need to know your baseline. What’s abnormal for you might be totally normal for someone else. Understanding a population can be helpful in some cases, but you’re limited in what you can do.
How Far Are We from Meaningful, Positive Impact?
McDuff: Fitbit already exposes stress metrics to the wearer, so you can see your heart-rate variability over time now. Whether you do that from a wearable or a camera or another device, that’s not a big leap. That’s just changing the sensor. The big leap is going from measurement to something that’s actually useful. With step counting, it’s easier because, say, if I only walked 1,000 steps, maybe tomorrow I’m prompted to go for a walk. But with stress, the intervention is a bit less clear.
I think we’re still a little bit away from moving from tracking toward the real utility. That could be several years because, as you get into these more complex things, like stress, personalization of the intervention is important. So we need to study how we can turn this tracking data — which I think is not exactly a solved problem, but there’s lots of data around us, so if we need data, that’s not usually the roadblock — turn it into insight that helps improve the utility that people get from the system.
What Avenues Exist for Misuse or Abuse?
McDuff: There needs to be at least good solid evidence that these signals matter in the [given] context. That’s always the first thing I would look for when thinking about a solution.
Another important consideration is the population. AI is often used to automate, and often, the population it’s applied to tends to be vulnerable. If you’re applying for a job, you’re vulnerable. The company is in power because they can choose to hire you or not. The same could be said of children in a classroom — students are more vulnerable and their teachers. It’s important to think about whether the technology is being used in a way that benefits someone who already has more power and could potentially make the process less transparent and more difficult.
“AI is often used to automate, and often, the population it’s applied to tends to be vulnerable.”
Another example is exam proctoring. Now that everyone’s remote, people are using systems to check if people are cheating. Well, that’s problematic, because an algorithm just can’t detect all the nuances, right? And you could really harm someone by labeling them a cheat. That doesn’t seem like a well-thought-out solution — just applying machine learning to that problem.
Mapping Isn’t One-to-One Across People
McDuff: There’s not a one-to-one mapping between our physiological state and our facial expressions that generalizes across everyone. Some people like it when their bodies are revved up. Adrenaline junkies feel happy when that happens, but some people get really nervous at the same type of physiological change. The exact same physiological change may be labeled completely differently by two different people. That’s a huge problem, but it doesn’t mean that things we’re able to sense about people are not useful. We just need to do it in a way that’s personalized and adaptive and brings humans into the loop to teach machines.
One solution might be that every person who uses a system teaches the machine what it means for them when it observes particular facial expressions, or particular physiological changes. You coach the machine, and, over time, it learns the mapping between your behaviors and how you feel. That, I think, is the biggest question in terms of practically making this technology effective and also safer. You don’t want to use a blunt tool. Yes, there is a broad mapping that [says] if I smile, I’m probably feeling positive. But there are many exceptions. That might be the modal label, but there’s huge variance. So this personalization is the huge question — and it’s not answered perfectly. There are ways to baseline and calibrate things, but there’s a long way to go.
Responses edited for length and clarity.