How Does Your Phone Know Which Calls Are Spam?
If you have a phone in the United States, you’ve probably gotten a call from Susie about your car’s extended warranty. Or Carol who needs to tell you about recent changes that will impact your student loans. Or maybe even that guy who calls to talk to Fredrick (or Carl or Santiago) about donating to a police (or firefighter) charity, but maybe you can help him instead.
These robocalled recorded voices and the scams they pitch aren’t alone in their onslaught against our phones. Scam calls are ubiquitous.
In 2020, one in five U.S. mobile phone users received three or more scam calls a day and an estimated 3 to 5 billion robocalls are placed every month. As a result, almost 90 percent of calls from unknown numbers go unanswered due to low user trust.
How Is Machine Learning Used to Fight Spam Calls?
The increasingly common “spam likely” message that pops up on your phone when it rings is part of the ongoing battle against such calls. These warnings are the result of machine learning efforts deployed by voice service providers, device companies and third-party app makers. Not only can this warn users before they pick up a call, but it can also help catch the scammers.
Using Machine Learning to Generate ‘Spam Likely’ Warnings
When your phone’s caller ID says “spam likely,” that is based on an analytics engine used by the carrier, said Mike Rudolph, CTO of YouMail, a third-party call protection provider that tracks and addresses robocalls. The big three carriers all partner with different analytics engine vendors: AT&T with Hiya, Verizon with TNS and T-Mobile with First Orion.
“All three of those guys have used machine learning based upon the data set they operate from in order to give you that ‘spam likely’ indication on those three mobile operators,” Rudolph said.
The data sets carriers use for this process come from call detail records. Calls made over the phone or via voice over internet protocol systems generate call detail records, which are logged by voice service providers (also called carriers) and telephone exchanges (also known as switches). Call detail records contain basic metadata about the call like call origin and destination, type of media (audio, SMS and so on), call duration, and whether or not the call connected.
“The behavioral analytics have been trained that a number it hasn’t seen before that makes 50,000 calls at 9 a.m. on a Monday is suspicious.”
Analytics engine vendors usually use behavioral analytics that typically examine reach, how many people a particular number is calling, and frequency, how many calls are made in a period of time, to identify suspicious callers.
Rudolph gave an example of a new number suddenly making tens of thousands of calls within a network around 9 a.m. on certain days.
“The behavioral analytics have been trained that a number it hasn’t seen before that makes 50,000 calls at 9 a.m. on a Monday is suspicious,” he said. “It will be marked as ‘spam likely.’”
In addition to the data in call detail records, phones offer native tools to help identify and mark spam calls, which provides another data stream that can be used in machine learning processes to identify potential spam calls. Apple, for example, has its “silence unknown numbers” feature on phones running iOS 13 and later operating systems. The Google Phone app for Android similarly includes caller ID and spam protection options that allow users to mark calls as spam.
Carriers similarly have their own systems: T-Mobile has ScamShield powered by First Orion, Verizon has Call Filter powered by TNS’ Call Guardian and AT&T has Call Protect powered by Hiya. Third-party apps like YouMail, RoboKiller, CallApp and those put out by Hiya, TNS and First Orion also allow users to mark calls as spam.
“An entry such as this gets added to the database as a spam call entry along with other regular calls,” said Albar Wahab, a data scientist trainee at Data Science Dojo. He said feature engineering can be used to select the best indicators of spam calls. Then traditional machine learning classification algorithms, such as support vector machines, can be applied to predict whether a future incoming call is potential spam. Deep learning algorithms like convolutional neural networks and long short-term memory can also be used to effectively automate the feature engineering step.
Other Ways to Identify Spam Calls and Those Who Allow Them
While voice service providers are limited to using the data in call detail records to identify potential spam calls due to privacy laws, third-party apps that users opt into can access more information about calls. YouMail, for instance, uses an audio fingerprint system to analyze the content of a call to identify known and potential scam robocalls without anyone actually listening to the call.
“We are 100 percent based on the audio in calls and we do nothing related to reach or frequency of calls,” Rudolph said. “For us, because we are an over the top information service, we can train machine learning based upon what the call said. That’s a completely different machine learning.”
YouMail, Rudolph explained, takes the audio of calls and turns them into images using fast fourier transform, or FTT, and constantQ-transform, or CQT. The resulting image is the audio fingerprint of a call. Using both supervised and unsupervised machine learning algorithms, YouMail plots the auditory differences between sample calls’ fingerprints and those of known scam calls. The less auditory difference between a sample or ongoing call’s fingerprint and that of known scam calls, the more likely that call is to be a scam.
The audio fingerprints can also be used to identify potential new scams as they happen, either based on a new cluster of very similar content or because of the content itself.
“For example, our machine learning knows some things that are binary,” Rudolph said. “If you get a call that says it is the [Internal Revenue Service] or the [Social Security Administration], that’s unequivocally going to be a fraudster calling you.”
The ability to identify scam calls in progress using audio fingerprints also allows for faster reporting to potentially identify bad actors or at least the voice service provider that carried the call.
When YouMail encounters a call that matches the audio fingerprints of known scam calls, it can be sent to the Industry Traceback Group within seconds of being identified, Rudolph said. The Industry Traceback Group can then track the scam call back to the provider that enabled the call. Because of the TRACED Act, which was signed into law in 2019, voice service providers are required to shut down accounts that send out unlawful calls.
Cutting Off the Flow of Data to Spammers
Just like those who fight scam calls with machine learning thrive on data, so too do the scammers. Though scam call reduction is not generally the purpose of most data privacy apps or services, reducing the publicly available data that scammers access can have the side effect of reducing scam calls as well.
“One thing that we tried to do at Kanary is identify the data sources that spammers use and then remove the data from there,” said Rachel Vrabec, founder and CEO of Kanary, a data privacy service. By removing their phone numbers and other personal information from public sources, customers are made less searchable. Being less searchable then makes it harder for robocalling scammers to identify live numbers to call, Vrabrec said.
“When you look at the supply chain of these phone numbers and how they end up in spammers’ arsenals to use, you don’t want to be like the first number on all their lists,” she said. “The goal is to help you keep your phone number more private.”
Spoofing Poses Challenges to Data Collection
While not all spam calls are robocalls, and not all robocalls are spam, there can be a lot of overlap. Increasingly, robocalls are spoofed, meaning the number that appears on your caller ID is not the actual number from which the call originated. While call spoofing can be done for legitimate reasons — like when a doctor calls you back from their personal phone but the office’s number is displayed on your caller ID to protect the doctor’s privacy — when scammers use spoofed robocalls, it’s to avoid being detected and tracked down.
“If you started to collect the wrong data about that number, you could easily mess up somebody’s landline connection and deliverability of those calls.”
“Offshore, and even onshore, less desirable-type companies you wouldn’t want to work with don’t want you to find out who they are, and they don’t want you to call them back on their real phone number, so they spoof a phone number.” said Brian Podolak, CEO at Vocodia, an AI sales and customer service platform.
Spoofing is a growing part of robo-scammers arsenal and this can dull the edge that machine learning puts on scam detection efforts. The short version, as Omer Khan, CTO at Vocodia, put it, is that machine learning suffers from the “garbage in, garbage out” issue.
Spoofed numbers can result in a lot of noise in a spam-detection machine learning model, said Vrabec. This could result in false signals.
“I could use your phone number and start spamming people with it,” she said. “If you started to collect the wrong data about that number, you could easily mess up somebody’s landline connection and deliverability of those calls.”
A ‘Fractured Environment’
Complicating the use of data to identify spam calls is the current nature of the telephony landscape. There are so many databases of call information being collected by different groups — the voice service providers, device companies, third-party apps, and even countries (think the National Do Not Call Registry) and states — all with slightly different systems of spam detection.
“Everybody’s doing it their own way and thinks that they have the better mousetrap,” Podolak said.
While there are some public registries of information on numbers, most of the carrier- or device-level registries do not interact. Both Khan and Podolak said that would have to change for bigger advances to be made in the use of machine learning against spam calls.
“The data has to be centralized somewhere, in my opinion, for it to be successful.”
“The data has to be centralized somewhere, in my opinion, for it to be successful,” said Podolak. “Otherwise, you’re going to have what you have right now, which is this fractured environment.” If he wanted to get his number registered as legitimate to make sure it doesn’t show up as “spam likely,” he said, he would have to go to about a dozen different groups to do so at the moment.
Khan said that he thought such a combined registry could not be done by a conventional carrier or device company. Instead, ownership of such a centralized registry would have to fall on a body like the Federal Communications Commission. Barring that, he noted that “there are already initiatives and for-profit ventures who are very interested in standardizing this.”
It would be possible for spam detection efforts using machine learning to be more accurate, he added, but it would take cooperation across the fragmented landscape.
“The companies and private entities need to be able to talk to each other to share that data and enforce that standard,” Khan said.
Maybe if that happened, we’d all get fewer calls from Susie about our extended warranty.