Why Racial Bias Still Haunts Speech-Recognition AI
When you ask Siri a question or request a song through Alexa, you’re using automated speech recognition software. But uses of that technology go far beyond consumer electronics.
Companies use AI services to screen job applicants. Court reporters use speech recognition tools to produce records of depositions and trial proceedings. Physicians use software by Nuance and Suki to dictate clinical notes. If you have a physical impairment, you might use speech recognition software to navigate a web browser. YouTube uses it to create automatic captions, whose malaprops inspired a parody series called Caption Fail. I used an automated transcription service to cross-check notes for this story.
For all its critics — and it has its share — speech recognition software has made dramatic progress in recent years. In 2017, at its annual developer conference, Google I/O, Google announced a word error rate of just 4.9 percent, down from 8.5 percent in 2016.
Yet for many Black users, the speech-to-text services of the world’s biggest tech companies appear to be fundamentally flawed.
“We think the disparity is largely due to the lack of diverse training data.”
A recent Stanford University study found the speech-to-text services used by Amazon, IBM, Google, Microsoft and Apple for batch transcriptions misidentified the words of Black speakers at nearly double the rate of white speakers. The study compared transcription results for 2,000 voice samples from recorded interviews with Black and white Americans.
The error rate for Black speakers was 35 percent, compared to a 19 percent error rate for white speakers.
“We think the disparity is largely due to the lack of diverse training data,” said Allison Koeneke, one of the study’s authors and a doctoral candidate in Stanford’s computational and mathematical engineering department.
Miriam Vogel, who is president and CEO at Equal AI, an organization that helps companies and policymakers address bias in AI, said the findings published in the journal Proceedings of the National Academy of Sciences could have far-ranging implications for Black speakers using these systems, from being shut out of automated ordering or delivery systems, to being unfairly evaluated in job candidacy and promotion decisions.
“A lot of companies are using AI for key functions like hiring and firing, and don’t realize they could be spreading the discrimination; doubling down,” said Vogel, who formerly served as acting director of justice and regulatory affairs at the White House and led President Barack Obama’s Equal Pay Task Force. “In particular, I’ve talked to a lot of law firms and they want to address their diversity problems. And so they’re using AI [in the hiring process] with this concept that it’s neutral.”
But that’s far from the truth, Koeneke said. Few people besides the engineers and developers who create these systems know how they work, because they are proprietary. But the lack of representation for Black speakers in training data that likely contributes to the problem goes back to the early days of speech recognition as a field.
“A lot of companies are using AI for key functions like hiring and firing, and don’t realize they could be spreading the discrimination; doubling down.”
Early voice corpuses, such as Switchboard, Koeneke said, were largely derived from recorded phone calls and skewed heavily toward white speakers in the middle of the country. When these data sets were adopted by large tech firms as they built out their automated speech-recognition engines, the biases carried over.
Common Speech Patterns Are Underrepresented in Training Sets
Trevor Cox, a professor of acoustical engineering at the University of Salford and author of Now You’re Talking, said that even today, voice AI struggles with regional dialects and the dialects of non-native English speakers, as well as women voices.
“We’ve seen this in machine learning in lots of areas, where, essentially, if you don’t give the data to the machine learning engine to learn, then it doesn’t learn it,” Cox said. “When Siri came out, it really struggled with Scottish accents because it didn’t have enough Scottish accents in its training data.”
More important for the Stanford study is the impact of African American Vernacular English, a dialect spoken by some Black speakers in the United States. AAVE is underrepresented in voice AI training sets. This lack of representation, said Koeneke, who co-authored the Stanford study, means that many of its linguistic features are poorly understood by speech recognition engines and thus poorly reproduced in script.
(It is worth noting that AAVE is only one factor at play. Gender is another: The study found that voice AI was significantly less accurate for Black men than for Black women.)
AAVE traces its legacy to enslaved persons, said John Baugh, a professor of linguistics and African American studies at Washington University in St. Louis. In many cases, they were separated from other speakers of their native languages by oppressors to limit the likelihood of uprisings.
Baugh, whose research focuses on linguistic discrimination and profiling, peer-reviewed the Stanford study.
“No indigenous African language survived intact. That set the stage for the kind of stereotyping we see today,” Baugh said.
That linguistic isolation laid the foundation for AAVE, Baugh explained, but other factors contributed too. Early American legal prohibitions on schooling played a role, as did exposure to the speech patterns of European-born indentured servants working on plantations. Later, racial segregation in cities like Philadelphia, Baltimore, Detroit and Chicago contributed to its evolution.
Experts hold differing theories about the origins of African American Vernacular English. Some linguists point to roots in British regional dialects and its similarities with English Creole. However, there is broad consensus among linguists that it has phonological, grammatical and syntactical differences from the language referred to variously as standard English or Mainstream American English (MAE).
A key feature of African American Vernacular English, Baugh said, is “camouflage construction,” a term coined by linguist Arthur Spears to reflect the shifting meaning and grammatical function of words based on their pronunciation and context.
For example, if someone asks, “Is your sister still married?” Baugh told me, the same response, “She been married,” can mean two different things. If the word “been” is left unstressed, it is commonly interpreted to mean she was married and no longer is. The same phrase with strong emphasis on “been” is commonly interpreted to mean she has been married for a long time.
This linguistic trait likely has a historical basis: “English is not a tone language. African languages are tone languages,” Baugh said.
Uses of ‘To Be’
Another key distinction between African American Vernacular English and MAE is the use of the verb form “to be.” In AAVE, the “invariant habitual” form of the word used in phrases such as “they be happy” means “habitually happy, happy all the time,” Baugh said.
This is distinct from expressions like “they happy” or “they are happy,” which reflect a momentary state. Baugh speculates that the habitual verb form can be traced to an African grammatical legacy and the influence of the brogue of Irish overseers.
Voice Recognition Software Doesn’t Capture These Nuances
To understand how this plays out in transcription algorithms, it is helpful to look at an example from the Stanford study. A transcribed audio snippet of a 67-year-old Black man from Washington, D.C., who speaks African American Vernacular English, is characteristic of the types of the errors these systems make:
“With seven braids, He’d give me a nickname Snake cause she said was sneaking. You know. Me sit in one place and she China, man, I’m staying someplace.”
Here’s what the man actually said:
“In second grade, teacher gave me nickname, Snake, cause, well, she said I was sneaky. You know I be sitting in one place and, she turn around, I’m sitting someplace else.”
The Acoustic Model Vs. The Language Model
Koeneke said that, although the voice recognition systems of the five software companies studied are proprietary and opaque, making any theory hard to verify, the disparities are most likely due to the limitations of the “acoustic model.” This model detects variations in intonation and prosody: the rhythms, syllable stress, and rising and falling pitches that make a Boston accent different from a Chicago accent.
This is distinct from the “language model,” which focuses on grammar and sentence structure.
Mistaking “second grade” for “seven braids” or “sitting” for “staying” appear to exemplify the faults of the acoustic model in recognizing subtle pronunciation differences, as described by Baugh.
“From the days of Bell Labs, a specific voice frequency band was chosen that still governs how our phones and microphones work.”
The automated transcription service Rev reports on its website that word error rates are calculated by a formula that includes substitutions (when words are replaced), insertions (when words are added that weren’t said) and deletions (when a word is left out completely). The number of variables speaks to the complexity of faithfully interpreting and reproducing speech — particularly in conversation, where it is highly unpredictable. A podcast is much more difficult to reproduce than basic commands to a digital assistant.
Another part of the problem may lie in the recording technology used to collect the data. “From the days of Bell Labs,” writes Joan Palmiter Bajorek, who is head of conversational research and strategy at the Australian voice and digital agency VERSA, “a specific voice frequency band was chosen that still governs how our phones and microphones work. Thus, the quality of the hardware and whether you’re using a 180-degree microphone does matter significantly in accuracy rates.”
In addition, “physical factors (women are typically shorter and have smaller vocal tracts) play a part in voice data,” she wrote in an email. “However, this does not explain the disparities completely. If your data set does not contain diverse demographics, it will most likely not perform well for diverse demographics.” And “if the data sets are incredibly biased, those discrepancies make it perform terribly for data it has not seen much of before, if at all,” she wrote.
There can be problems on the language side of the model too. After the acoustic model makes its best guess at what a speaker is saying by parsing the audio file into phonemes, Cox said, the language model attempts to correct incongruities and fill in the gaps through linguistic associations. The machine learning applications that do this are likely to contain societal biases.
As Cox writes in the Harvard Business Review, “type the Turkish ‘O bir hemşire. O bir doctor’ into Google Translate and you’ll find ‘She is a nurse’ and ‘He’s a doctor.’ Despite ‘o’ being a gender-neutral third-person pronoun in Turkish, the presumption that a doctor is male and a nurse is female arises because the data used to train the translation algorithm is skewed by the gender bias in medical jobs.”
A similar phenomenon could be particularly problematic to Black Americans in the U.S., he writes, with a University of Bath study “showing that, in typical data used for machine learning, African American names are used more often alongside unpleasant words” than European American names.
He points out that the disproportionate number of African Americans incarcerated in the U.S,. and the number of news reports surrounding them, is a dangerous cocktail when it comes to machine learning training data, which historically has been culled from audiobooks, public talks, phone calls and YouTube videos.
And the structural disparities embedded in these systems could actually grow worse, if left unaddressed, Koeneke said: “If you think about Google’s products, they’re collecting data from their users, but, of course, if only white users are able to successfully use Google Voice or Google Assistant, then they’ll be more likely to continue using the product and generating more training data for Google.”
A Self-Fulfilling Feedback Loop
The bias, in effect, could create a self-fulfilling feedback loop, in which machine learning applications become more adept at recognizing and accurately representing certain end users, to the exclusion of others, said Miguel Jette, director of speech R&D at the transcription service Rev.
“So, if 80 percent of your users are men and 20 percent are women, then the test set often represents that split. Now, when you use that data set to tune your system, you might, without noticing, bias your system toward men because it moves the needle on your test set,” Jette said.
Without legal guardrails and social pressure, Cox said, large, multinational companies are unlikely to correct the imbalance on their own, as the market provides little incentive.
“If they have to have to decide where to put resources, they’ll put them in making sure it works for China, for example,” he said. “It’s a much bigger market. So where there are big markets for their products is where things will tend to work well.”
“If only white users are able to successfully use Google Voice or Google Assistant, then they’ll be more likely to continue using the product and generating more training data for Google.”
This is a phenomenon Dwayne Samuels, CEO of Samelogic, witnessed for himself. His company spent years developing video transcription tools for clients, such as Nielson, who used them to collect data from focus groups of English speakers in the Caribbean and Europe, among other places.
“At first, we wrote a lot of our own models, but it’s a very expensive thing to actually do. So we turned to Microsoft Project Oxford, IBM services and Amazon Transcribe. And we’ve found that, for the most part, these worked for the general population, I would say for 70 percent of people they work well.”
For the other 30 percent, however, largely speakers who lived outside the United States, the results were less impressive.
The effectiveness of speech recognition software in capturing regional dialects is an understudied area Baugh would like to see investigated further. Though the Stanford study clearly documented that the software makes mistakes, and that some programs perform better than others — Microsoft, the most accurate system, had a 27 percent error rate for Black speakers and 15 percent for white speakers; Apple, the lowest performer, missed the mark for 45 percent of words from Black speakers and 23 percent of white speakers — it has limitations in its scope, he said.
As reported by the New York Times, much of the speech corpus used to train voice assistants was culled from white speakers in Sacramento and northern California; the speech data for Black speakers was collected in “a largely African-American rural community in eastern North Carolina, a midsize city in western New York and Washington, D.C.”
“The study calls attention to this issue in racial terms,” Baugh said, “but voice recognition software might not evaluate all white dialects equally well either. The Stanford researchers simply didn’t test that. I almost want to go back to the researchers and say, ‘I really wish you had, in addition to looking at Black dialect, found white southern samples as well.’ The point I’m emphasizing is I hope the racial divide is not masking another problem.”
Fixing the Problems Will Require Real Investment
If casting a wider research net and drawing back the curtain on privately held algorithms represent significant challenges, correcting the disparities in practice may be even more difficult.
Open-source software libraries like Google’s TensorFlow can help create more representative data sets, Samuels said, but building and testing use cases from scratch is extremely difficult for small or mid-sized companies, who often rely on fee-based, pre-built models. Even then, editing errors in transcripts to help train the software during the “supervised” learning phase can be extremely costly.
In Samelogic’s case, poring over videos frame by frame to reach a 93 percent word error rate — a threshold that brought high confidence that the system could thereafter train itself — was ultimately unsustainable. The company shifted its business to provide a service for product managers to run experiments on SaaS products.
Open-Source Communities Are Taking on the Challenge
Thankfully, brighter days may be on the horizon.
VERSA’s Bajorek told Built In that open-source projects like Artie Bias Corpus: An Open Dataset for Detecting Demographic Bias in Speech Applications and Mozilla Common Voice are working to make data sets more equitable and inclusive. Citing the website of Rachael Tatman, a developer advocate for Rasa who has published extensively on gender and dialect bias in speech recognition software, she said there is a growing body of research and publicly available resources for those eager to learn more about the disparities in common data sets used in the industry.
She also noted that “many companies are working on these problems,” and that, “depending on elections, legislative bodies and who is in power politically, there is great potential for regulations and laws like General Data Protection Regulation and the International Covenant on Civil and Political Rights to drive mandatory changes in this industry.”
Still, there are widely conflicting reports of the accuracy of automated speech recognition systems. In speech sciences, Bajorek wrote, “we measure the performance of a system through word error rate (WER). The best-performing systems have low WER and thus higher accuracy. To use a system reliably, the industry standard is that a system must have a 10 percent WER or lower, which means it’s performing well 90 percent or more of the time.”
The significant discrepancy between Google’s 4.9 percent figure, as reported at the 2017 I/O developer conference, and the average figures reported in the Stanford study leave questions as to variance in the analysis methods and the demographics of the user pool Google applied to test its error rate.
“If our systems were specifically designed for a Spanish-English bilingual eight-year-old girl,” Bajorek noted, “they would be radically different from the speech recognition systems we have today. This makes you wonder who is on the development and design teams of these systems. Having diversity on the building side is crucial. Recruiting and retaining diverse talent is paramount to fixing this problem — and that’s systemic across tech.”
“If our systems were specifically designed for a Spanish-English bilingual eight-year-old girl, they would be radically different from the speech recognition systems we have today.”
Vogel said legal pressure could nudge companies toward better practices. She cites Ousmane Bah’s $1 billion lawsuit against Apple over a false arrest he alleges resulted from Apple’s face recognition system as the kind of “legal liability we can expect to see in the coming years, as lawyers are clued into the liabilities that could be embedded with these biases.”
Microsoft’s 10-K filing to the Securities and Exchange Commission in 2018, as reported in Quartz, is another signal of the business risks companies may face, if such disparities are left unaddressed. Here’s a portion of the report:
“AI algorithms may be flawed. Datasets may be insufficient or contain biased information. Inappropriate or controversial data practices by Microsoft or others could impair the acceptance of AI solutions. These deficiencies could undermine the decisions, predictions, or analysis AI applications produce, subjecting us to competitive harm, legal liability, and brand or reputational harm.”
Koeneke said she hopes the public nature of the study will inspire action on the part of the companies studied and the tech community at large.
“Speaking on behalf of the authors of our paper,” she said, “we’re hoping that, not only do we end up collecting more diverse training data to help these pipelines, we also hope these companies will be more transparent in showing us, year over year, the amount of their improvement with regard to diversity and speech recognition systems. Some subset of them seem to be receptive to improving and have read our paper. And so that is a positive first step.”