If you have a phone in the United States, you’ve probably gotten a call from a recorded voice pitching you a product to buy or a cause to donate to or a free trial to sign up for.
Spam and scam calls are ubiquitous. In 2021, First Orion estimated approximately 110 billion scam calls were sent out in that year alone, which can explain why 90 percent of calls from unknown numbers go unanswered.
What Are Potential Spam or Scam Likely Calls?
The increasingly common spam warning message that pops up on your phone when it rings is part of the ongoing battle against such calls. These warnings are the result of machine learning efforts deployed by voice service providers, device companies and third-party app makers. Not only can this warn users before they pick up a call, but it can also help catch the scammers.
How Machine Learning Generates Spam Call Warnings
When your phone’s caller ID says “Spam Risk” or “Scam Likely,” that is based on a machine learning analytics engine used by the carrier, Mike Rudolph, chief technology officer at YouMail, said. The big three carriers all partner with different analytics engine vendors: AT&T with Hiya, Verizon with TNS and T-Mobile with First Orion.
“All three of those guys have used machine learning based upon the data set they operate from in order to give you that ‘spam risk’ indication on those three mobile operators,” Rudolph said.
How Does Machine Learning Fight Spam Calls?
The data sets carriers use for this process come from call detail records. Calls made over the phone or via voice over internet protocol systems generate call detail records, which are logged by voice service providers (also called carriers) and telephone exchanges (also known as switches). Call detail records contain basic metadata about the call like call origin and destination, type of media (audio, SMS and so on), call duration, and whether or not the call is connected.
“The behavioral analytics have been trained that a number it hasn’t seen before that makes 50,000 calls at 9 a.m. on a Monday is suspicious.”
Analytics engine vendors usually use behavioral analytics that typically examine reach, how many people a particular number is calling, and frequency, how many calls are made in a period of time, to identify suspicious callers.
Rudolph gave an example of a new number suddenly making tens of thousands of calls within a network around 9 a.m. on certain days.
“The behavioral analytics have been trained that a number it hasn’t seen before that makes 50,000 calls at 9 a.m. on a Monday is suspicious,” he said. “It will be marked as ‘spam likely.’”
In addition to the data in call detail records, phones offer native tools to help identify and mark spam calls, which provides another data stream that can be used in machine learning processes to identify potential spam calls. Apple, for example, has its “Silence Unknown Callers” feature on phones running iOS 13 and later operating systems. The Google Phone app for Android similarly includes caller ID and spam protection options that allow users to mark calls as spam.
Carriers similarly have their own systems: T-Mobile has ScamShield powered by First Orion, Verizon has Call Filter powered by TNS’ Call Guardian and AT&T has Call Protect powered by Hiya. Third-party apps like YouMail, RoboKiller, CallApp and those put out by Hiya, TNS and First Orion also allow users to mark calls as spam.
“An entry such as this gets added to the database as a spam call entry along with other regular calls,” Albar Wahab, a data scientist trainee at Data Science Dojo, said. Feature engineering can be used to select the best indicators of spam calls, he added. Then traditional machine learning classification algorithms, such as support vector machines, can be applied to predict whether a future incoming call is potential spam. Deep learning algorithms like convolutional neural networks and long short-term memory can also be used to effectively automate the feature engineering step.
How Do Spoofed Calls and Robocalls Work?
Not all spam calls are robocalls, and not all robocalls are spam, though there can be a lot of overlap. When a robocall is carried out for spamming purposes, these calls are usually spoofed, meaning the number that appears on your caller ID is not the actual number from which the call originated.
What Are Spoofed Numbers and Calls?
While call spoofing can be done for legitimate reasons — like when a doctor calls you back from their personal phone but the office’s number is displayed on your caller ID to protect the doctor’s privacy — when scammers use spoofed robocalls, it’s to avoid being detected and tracked down.
“If you started to collect the wrong data about that number, you could easily mess up somebody’s landline connection and deliverability of those calls.”
“Offshore, and even onshore, less desirable-type companies you wouldn’t want to work with don’t want you to find out who they are, and they don’t want you to call them back on their real phone number, so they spoof a phone number.” said Brian Podolak, CEO at Vocodia, an AI sales and customer service platform.
Spoofing is a growing part of robo-scammers’ arsenals, and this can dull the edge that machine learning puts on scam detection efforts. The short version, as Omer Khan, chief technology officer at Vocodia, put it, is that machine learning suffers from the “garbage in, garbage out” issue.
Spoofed numbers can result in a lot of noise in a spam-detection machine learning model, Vrabec said. This could result in false signals.
“I could use your phone number and start spamming people with it,” she added. “If you started to collect the wrong data about that number, you could easily mess up somebody’s landline connection and deliverability of those calls.”
Apps to Identify and Block Spam Calls
While voice service providers are limited to using the data in call detail records to identify potential spam calls due to privacy laws, third-party apps and services can let users access more information about calls.
Audio Fingerprinting Apps
YouMail, a robocall blocking software, uses an audio fingerprint system to analyze the content of a call to identify known and scam-likely robocalls without anyone actually listening to the call.
“We are 100 percent based on the audio in calls and we do nothing related to reach or frequency of calls,” Rudolph said. “For us, because we are an over the top information service, we can train machine learning based upon what the call said. That’s a completely different machine learning.”
YouMail, Rudolph explained, takes the audio of calls and turns them into images using fast fourier transform, or FTT, and constantQ-transform, or CQT. The resulting image is the audio fingerprint of a call. Using both supervised and unsupervised machine learning algorithms, YouMail plots the auditory differences between sample calls’ fingerprints and those of known scam calls. The less auditory difference between a sample or ongoing call’s fingerprint and that of known scam calls, the more likely that call is to be a scam.
The audio fingerprints can also be used to identify potential new scams as they happen, either based on a new cluster of very similar content or because of the content itself.
“For example, our machine learning knows some things that are binary,” Rudolph said. “If you get a call that says it is the [Internal Revenue Service] or the [Social Security Administration], that’s unequivocally going to be a fraudster calling you.”
The ability to identify scam likely calls in progress using audio fingerprints also allows for faster reporting to potentially identify bad actors or at least the voice service provider that carried the call.
When YouMail encounters a call that matches the audio fingerprints of known scam calls, it can be sent to the Industry Traceback Group within seconds of being identified, Rudolph said. The Industry Traceback Group can then track the scam call back to the provider that enabled the call. Because of the TRACED Act, which was signed into law in 2019, voice service providers are required to shut down accounts that send out unlawful calls.
Data Privacy Apps
Just like those who fight spam calls with machine learning thrive on data, so too do the scammers. Data of personal information, especially phone numbers, can easily be circulating in deeper online databases without the owner’s knowledge. Usually populated from instances of data breaches and data selling, these areas of the web are where scammers can point to for accessible troves of phone numbers.
Scam call reduction is not generally the purpose of most data privacy apps or services, but reducing the publicly available data that scammers access can have the side effect of reducing scam-likely calls as well.
“One thing that we tried to do at Kanary is identify the data sources that spammers use and then remove the data from there,” said Rachel Vrabec, founder and CEO of Kanary, a data privacy service. By removing their phone numbers and other personal information from public sources, customers are made less searchable. Being less searchable then makes it harder for robocalling scammers to identify live numbers to call, Vrabrec said.
“When you look at the supply chain of these phone numbers and how they end up in spammers’ arsenals to use, you don’t want to be like the first number on all their lists,” Vrabec added. “The goal is to help you keep your phone number more private.”
The Future of Spam Call Detection
Complicating the use of data to identify spam calls is the current nature of the telephony landscape. There are so many databases of call information being collected by different groups — the voice service providers, device companies, third-party apps, states and even countries (think the National Do Not Call Registry) — all with slightly different systems of spam detection.
“Everybody’s doing it their own way and thinks that they have the better mousetrap,” Podolak said.
While there are some public registries of information on numbers, most of the carrier- or device-level registries do not interact. Both Khan and Podolak said that would have to change for bigger advances to be made in the use of machine learning against spam calls.
“The data has to be centralized somewhere, in my opinion, for it to be successful.”
“The data has to be centralized somewhere, in my opinion, for it to be successful,” said Podolak. “Otherwise, you’re going to have what you have right now, which is this fractured environment.” If he wanted to get his number registered as legitimate to make sure it doesn’t show up as “spam likely,” he said, he would have to go to about a dozen different groups to do so at the moment.
Khan believes such a combined registry could not be done by a conventional carrier or device company. Instead, ownership of such a centralized registry would have to fall on a body like the Federal Communications Commission. Barring that, he noted that “there are already initiatives and for-profit ventures who are very interested in standardizing this.”
It would be possible for spam detection efforts using machine learning to be more accurate, he added, but it would take cooperation across the fragmented landscape.
“The companies and private entities need to be able to talk to each other to share that data and enforce that standard,” Khan said.
Maybe if that happened, we’d all get fewer mysterious calls about our extended warranty.