Emotion AI: 3 Experts on the Possibilities and Risks

Emotion AI, also known as affective AI or affective computing, is a subset of artificial intelligence that analyzes, reacts to and simulates human emotions. Relying on natural language processing, sentiment analysis, voice emotion AI and facial movement analysis, emotion AI interprets human emotional signals coming from sources such as text, audio and video.

What Is Emotion AI?

Emotion AI refers to artificial intelligence that detects and interprets human emotional signals in text (using natural language processing and sentiment analysis), audio (using voice emotion AI), video (using facial movement analysis, gait analysis and physiological signals) or combinations thereof.

Emotional AI is surprisingly common. For instance, sentiment analysis is a tool sometimes found in industries like marketing, where it’s used for product review analysis and recommendation tailoring, as well as finance, where it can help forecast stock movements.

It’s also set to influence industries like healthcare, insurance, education and transportation. In the future, we may see emotion AI used to diagnose depression, detect insurance fraud, determine how a student is comprehending a lesson or assess a driver’s performance.

The future of emotion AI technology is only growing brighter. The market size for emotion recognition is expected to jump 12.9 percent by 2027.

Advantages of Emotion AI

The benefits of emotion AI can vary from industry to industry, but it can provide marketers, advertisers, designers, engineers and developers with valuable feedback from consumers.

Emotion AI can streamline user testing, consumer surveys and focus groups.
Designers can use emotion AI to glean insights from consumer reactions to ad campaigns, prototypes and mockups, which can save them time and money.
Emotion AI’s ability to capture and analyze human body language and emotions means that the products and services made with it will be more considerate of the user’s needs and feelings.

Disadvantages of Emotion AI

Despite all the use cases and potential for this type of AI, emotions are fuzzy — and applying some of these technologies to high-consequence situations can be deeply problematic.

Emotion AI doesn’t give the full-picture of how someone is feeling. A notorious example is a hiring system that uses job candidates’ facial expressions and voice patterns to determine an “employability score.”
Like other forms of technology, emotion AI can display biases and inaccuracies.
Consumers have to consent to being analyzed by emotion AI, which may present some privacy concerns.

Types of Emotion AI

There are three main types of emotion AI: text-focused, voice-focused and video and multimodal emotion AI.

Text emotion AI analyzes the written word. For example, emotion AI can be used to analyze text in the form of online comments or news stories to determine if the content is generally positive or negative.
Audio and voice emotion AI analyzes human speech. This type of emotion AI can be used to assess and track customer service calls to determine both the vocal patterns of conversations and the content.
Video and multimodal emotion AI is used to process video signals ranging from eye movement to body language.

To better understand emotion AI and how it works we spoke with four experts: Seth Grimes, the founder of Alta Plana and a natural language processing consultant; Ranal Gujral, CEO of Behavior Signals; Skyler Place, chief behavioral science officer of Cogito; and Daniel McDuff, a former principal researcher at Microsoft AI. Their thoughts, broken down by the different types of emotion AI, have been edited and condensed.

Text Emotion AI: NLP and Sentiment Analysis

Sentiment analysis refers to the application of natural language processing to text samples in order to determine whether the sentiments expressed are positive or negative, and to what degree. A common application is when companies use the technique to analyze posted reactions to their products or services.

How Does Text Emotion AI Work?

Seth Grimes: A prevailing approach nowadays is using transfer learning for pre-trained models. There’s a huge pre-trained model, but then you do the so-called last-mile training, using your own data. By customizing it for your own uses, by doing that last-mile training, you get the accuracy that you want.

A company might have a solution for the hotel industry that contains a certain set of taxonomies. It understands that a hotel has rooms, service, a restaurant — it understands the structure of the thing being analyzed. But it doesn’t necessarily understand that, say, Hilton Hotels has certain branding, like the reward system is Hilton Honors, and so on. That kind of last-mile training used to be via customizing taxonomies and role sets. Nowadays, it’s transfer learning.

Find out who's hiring.

See all Data + Analytics jobs at top tech companies & startups

View Jobs

How Does Emotion AI Deal With Complexity?

Grimes: [Sentiment analysis] is a little bit controversial still, because there are questions about accuracy and usability — whether numbers actually correspond to real-world sentiment. To give an indicator of the complexity, I use an image in my presentations of Kobe Bryant smiling on the basketball court. And I ask, “How does this make you feel?” Well, Kobe Bryant died a year ago in a tragic crash. So [just] because Kobe Bryant is smiling, that doesn’t make you happy. Maybe it’s sad, but sadness about someone who has died is actually a positive sentiment. There’s a lot of complexity and subjectivity.

Clarabridge is a good example [of the technology’s advancement]. They do sentiment analysis, and over the last few years, have moved into emotion categorizations. It can be simple, like happy, sad, angry, or it can be hierarchical with lots of categories — as opposed to just positive or negative. The initial refinement was [just] positive or negative, on a scale of -10 to 10, to capture intensity of feeling — where, for example, “I’m furious” is more intense than “I’m angry.”

How Accurate Is Text Emotion AI?

Grimes: Accuracy in this world has two components: precision and recall. Precision — what’s your target? Are you trying to get the [overall] sentiment of a product review, either positive or negative? That’s imprecise. For instance, consider a positive review of a vacation stay. Well, what aspect? Did they like the room, the staff, dining options, location? So you get into what’s called aspect-based sentiment analysis. That’s much more precise [because] it’s more narrowly focused. You might have 99 percent accuracy deciding whether a review is positive or negative. But what’s really actionable is what the review is positive or negative about.

“Maybe it’s good that my phone is thin. But if someone says the sheets at a hotel were thin, that’s not good.”

A few years ago, companies really started creating much more narrowly focused models — a different model for restaurants versus hotels versus consumer electronics. For example, maybe it’s good that my phone is thin. But if someone says the sheets at a hotel were thin, that’s not good. So you get into the notion that a model should be industry-trained or -focused.

How Can Text Emotion AI Be Abused?

Grimes: Here’s a nightmare scenario for text: Maybe we can detect from what a person has written online or to [a text-based crisis line] that they might be suicidal. Well, what if their health insurer got ahold of that information? Or a car insurer takes facial codings from within a car and says: “That person is driving angry; I’m going to raise his rates, because he’s more likely to get into an accident.” Whether these are misuses or not depends on your perspective, but they are potential uses that people are not expecting of emotion data, and they’re potential abuses.

Audio and Voice Emotion AI

Companies like Behavioral Signals and Cogito develop voice emotion AI for call-center environments. This technology can be used to provide real-time feedback to representatives or find the best match between agents and the people they call. In some cases, audio emotion AI can analyze vocal information and determine the tone of speakers as well as the content of conversations.

How Does Audio and Voice Emotion AI Work?

Skyler Place: The market historically has focused on natural language processing and sentiment analysis — and the technology to support NLP accuracy continues to improve. In parallel, there have been organizations, Cogito included, focused on what we call the honest signals, which are everything in the conversation besides the words — energy in the voice, pauses, intonation, the whole variety of signals that help us understand the intentions, goals and emotions that people have in conversations — a very, very rich signal of information.

What we’re starting to see for the first time is the melding of these two data streams. I think that’s going to be the technological leap forward — the ability to really combine the understanding of NLP with the honest signals — that’s going to give us a novel way to understand and improve the emotion in conversations as we go forward.

We have about 200 different signals that we utilize to recognize these behaviors. And then we link the behaviors to the outcomes that are valuable for [call-center] calls. That’s how we think about emotion — less about pure recognition, more about understanding the behaviors that allow you to not just measure, but influence, the emotions in an interaction.

How Do Cultural Differences Affect Voice Emotion AI?

Rana Gujral: For the most part, it’s a challenge to recalibrate the baseline. Take how we express excitement versus anger. They both potentially could be high-pitched, but there are subtle differences. And our brains are obviously very adept at identifying the difference between the two. Even if you’re watching a foreign-language movie without subtitles, you can tell if a character is angry or excited, just by tone. The question is, how do you codify that?

We found that, once you recalibrate that baseline for a new language data set, the variance from the baseline that identifies those specific signals is almost identical across all cultures. We just need 10 to 20 hours of data — a maximum of 50 hours of data — to recalibrate that baseline.

Place: It absolutely is an important issue, and it ties into the potential bias that you see in algorithms. With our approach, the system will work out of the box, because you have this ability to understand how people are speaking — we can measure whether you’re speaking English or Spanish, whether you’re in New York or the Deep South — but the contextual meaning of that speaking rate may be different. So what “good” is may be different, based on the goals of the conversation and the region that [speakers] are from. So we tackle this at a variety of different stages across the product life cycle.

First, when we build our data sets, we measure and account for some of these different variables, making sure we’re representing them in the data. Second, we have human annotators listen to calls and mark up, for example, if a person is speaking too quickly for a particular part of the call. We make sure we pick those individuals from a variety of backgrounds — different genders, ages, cultures, so that we aren’t just getting one perspective of what “good” is for a call.

When we deploy to a client, we go through a calibration period where we’ve listened to hundreds and thousands of calls for that particular culture and particular use case and confirm that the settings are appropriate. That gives an opportunity to turn the dials to make sure we haven’t deployed anything that’s going to be adverse.

How Does Voice Emotion AI Deal With Complexity?

Gujral: Some attributes are super complex. There’s no system that can completely pinpoint sarcasm. But for the most part, you’re keying off on some essential interaction attributes.

“Some attributes are super complex. There’s no system that can completely pinpoint sarcasm.”

The essential emotions are very measurable — anger, happiness, sadness, frustration and neutrality. Then you’re looking at measuring positivity and arousal — which is a tone change — and behaviors like politeness, engagement, agitation. You can also build some KPIs using the specific domain data and metadata — like measuring quality of interaction or agent performance. They’re all built on these basic sorts of signals. We have quite a few signals. Some we do much more accurately than others. For anger, we can produce a high-90s percentage accuracy, then there are others that are a little bit harder, with more false reads.

How Can Voice Emotion AI Be Abused?

Gujral: Yale University professor Michael Krauss [noted] in a 2017 research paper that humans are actually really adept at masking emotions in our facial expressions. So I personally feel that implementations based on facial expressions are more problematic. The worst thing there can be for an AI implementation is an inaccurate or ineffective implementation.

I think there are lines that we need to draw. First is a line of privacy and choice. The consumer needs to be aware and willing to partake. That’s very important. And there are other moral boundaries. Last year, when we were fundraising, we had a regent from a private company that works on the behalf of a government agency in Europe — a friendly European country, but they work for defense systems — and they wanted to apply our intent models to immigration. And we said no. If a choice is going to be made on, say, visa overstay or immigration policies, which could be life-altering, we don’t want to go there. I’m not going to say our technology is not that accurate, but that’s a much higher bar.

What Are the Challenges for Voice Emotion AI?

Gujral: Data is always a key challenge — the more, the better, and quality matters tremendously. In a contact-center set-up, you’re sort of guaranteed high-quality data; it’s all recorded over professional recording equipment with channel separation and very little noise. But this year was interesting. We’ve seen an explosion in calls, but many have been poor quality, because oftentimes agents are working from home.

Place: A really interesting technical challenge is the synchrony of signals, which has to do a bit with computational load and the processing of different systems. Because we’re focused on real-time guidance, we’ve put an enormous amount of effort into low-latency code. What’s been interesting as we start to integrate the nonverbal signal computations with natural language processing, the [NLP] has a much longer delay. It takes more computational processing and more time to describe the words that are happening. How you build a product that can combine those two different signals in a way that’s both accurate, but also actionable and timely, is a really interesting design question, as well as a back end, data-pathways question.

Video and Multimodal Emotion AI

When emotional AI is used to analyze video, it can include facial expression analysis, but also things like gait analysis to glean certain physiological signals through video. Emotion AI can also be used to track eye movement by using infrared eye-tracking cameras or webcams to map pupil movement or dilation as well as gaze time in response to various stimuli. Emotion AI that tracks eye movement records eye behavior and creates a heat to highlight what a viewer was drawn to while looking at a digital ad, video or website. In some cases, a person’s respiration and heart rate can be detected contactlessly using cameras.

How Does Video and Multimodal Emotion AI Work?

Daniel McDuff: What’s made this possible is the investment in camera technology over the past 20 years — cameras that have high-quality sensors with low noise. Using a camera even with a simple signal-processing algorithm, if you analyze skin pixels, you can typically pull out a pulse signal and also detect respiration for a person who’s stationary.

But if you think about needing it to work when people are moving, maybe the lighting is changing, [plus] everyone has different skin pigmentation, facial hair — that’s where deep learning helps to make it more robust to all these sources of noise. The camera is sensitive enough to pick up the initial signal, but that often gets overwhelmed by different variations that are not related to physiological changes. Deep learning helps because it can do a very good job at these complex mappings.

Find out who's hiring.

See all Data + Analytics jobs at top tech companies & startups

View Jobs

How Do Cultural Differences Affect Video Emotion AI?

McDuff: There are definitely differences across cultures. We see that in our own analyses of large-scale image data sets, but also in a lot of psychology research, both in self-reported data and in observational measurement.

However, those differences tend to be pretty small when compared to individual differences even within the same culture. For instance, my brother might be more expressive than me by quite a large margin even though we come from the same family. So there are large individual differences even within people from very similar backgrounds.

“There are definitely differences across cultures.... However, those differences tend to be pretty small when compared to individual differences even within the same culture.”

That doesn’t mean that understanding cultural differences isn’t interesting from a scientific and psychological perspective. But when it comes to modeling the data, ultimately that’s just one, in some cases quite small, source of variation, when there are many other large sources of variation — [situational] context, gender, background. The way you’re treated as you grow up influences how you behave, right? If people around you are not that expressive, you might end up not being that expressive.

Studying this is very interesting intellectually, but when it comes to actually putting [models] into practice, often I think the social context of the situation matters way more. If you can control for the context, then you can compare across cultures and see these differences. But if I take a video of someone doing karaoke in Japan and compare it to someone in an office in the United States, the Japanese person is going to seem super expressive. But that’s not really a fair comparison, right? Because the context is so different.

How Can Individuals Use Video Emotion AI?

McDuff: Many applications at the moment look at group responses, [such as] measuring how much people smile when they watch an advertisement. They typically don’t do that for one individual; they do that for 30 or 40 individuals, then average the data. Marketing folk find that data useful, and they often combine it with self-reported and other data. That’s fine. But in that case, they’re not really interpreting each individual’s underlying experience that much. They’re looking at what’s observed on the face and using that as a quantitative measure of responses to the content.

But when you get into something like health, you’re often looking at individuals’ behaviors, trying to understand if what you’re observing are perhaps symptoms, or effects of a medication. For instance, you may have someone with Parkinson’s disease taking medication that controls tremors. Say you were using video to track how quickly the medication wears off. You need to personalize that because different individuals might have different magnitudes and manifestations. So understanding a group-level signal is not that useful, because you’re trying to target an individual.

Most of the applications I’m interested in do involve this personalization. If we’re tracking heart rate variability, we need to know your baseline. What’s abnormal for you might be totally normal for someone else. Understanding a population can be helpful in some cases, but you’re limited in what you can do.

How Far Are We From Meaningful, Positive Impact?

McDuff: Fitbit already exposes stress metrics to the wearer, so you can see your heart-rate variability over time now. Whether you do that from a wearable or a camera or another device, that’s not a big leap. That’s just changing the sensor. The big leap is going from measurement to something that’s actually useful. With step counting, it’s easier because, say, if I only walked 1,000 steps, maybe tomorrow I’m prompted to go for a walk. But with stress, the intervention is a bit less clear.

I think we’re still a little bit away from moving from tracking toward the real utility. That could be several years because, as you get into these more complex things, like stress, personalization of the intervention is important. So we need to study how we can turn this tracking data — which I think is not exactly a solved problem, but there’s lots of data around us, so if we need data, that’s not usually the roadblock — turn it into insight that helps improve the utility that people get from the system.

How Can Video Emotion AI Be Abused?

McDuff: There needs to be at least good solid evidence that these signals matter in the [given] context. That’s always the first thing I would look for when thinking about a solution.

Another important consideration is the population. AI is often used to automate, and often, the population it’s applied to tends to be vulnerable. If you’re applying for a job, you’re vulnerable. The company is in power because they can choose to hire you or not. The same could be said of children in a classroom — students are more vulnerable and their teachers. It’s important to think about whether the technology is being used in a way that benefits someone who already has more power and could potentially make the process less transparent and more difficult.

“AI is often used to automate, and often, the population it’s applied to tends to be vulnerable.”

Another example is exam proctoring. Now that everyone’s remote, people are using systems to check if people are cheating. Well, that’s problematic, because an algorithm just can’t detect all the nuances, right? And you could really harm someone by labeling them a cheat. That doesn’t seem like a well-thought-out solution — just applying machine learning to that problem.