It’s estimated that the AI market will reach nearly $4 trillion by the end of 2022 and one huge contributor to this market growth is smart cloud communications. Communications companies are investing in intelligent solutions to boost the efficacy and efficiency of telephony applications such as call tracking, contact centers, chatbots, and more. One of the most promising intelligent solutions for cloud communications is automatic speech recognition technology.

Read More Another Artificial Intelligence Story on Built In’s Expert Contributor NetworkWhat Can’t AI Do?


What Is Automatic Speech Recognition?

Automatic speech recognition, or ASR, uses deep learning, machine learning, and/or artificial intelligence technology to process human speech into a readable text. Recent advances in deep learning and other related fields have significantly increased the accuracy and usability of ASR systems, increasing their usefulness and thus, subsequent integration into our daily lives. We see ASR applications in everything from social media captions to podcast transcriptions to media monitoring.

Speech-to-text APIs work to accelerate innovation in the communications industry through two main avenues: audio transcription and audio intelligence. 


Audio Transcription

Speech-to-text transcription via intelligent APIs lets companies convert audio and/or video files to a text file quickly and with human-level accuracy. This can be done both asynchronously, where previously recorded audio or video streams are transcribed after the fact, or in real-time, where the API streams transcripts within a few milliseconds of each word spoken.

In addition, some APIs offer other features that boost both the accuracy and readability of transcripts. These include speaker diarization, or speaker labels, that automatically detect both the number of speakers in each audio or video stream and accurately assign each word or text segment to the corresponding speaker. Features like custom vocabulary, in which a user adds a list of terms and/or spellings unique to the company or industry, also significantly increase a transcript’s accuracy.

Additional accuracy-boosting features include:

  • Filler words (uh, um, etc.) recognition
  • Automatic punctuation and casing
  • Automatically breaking down transcriptions into paragraphs and sentences
  • Profanity filtering
  • Word search

And more. 


Audio Intelligence

The real ROI for cloud communications is found in the additional AI-backed features provided by some speech-to-text APIs, referred to as audio intelligence features. These powerful features aid the creation of smarter analytics, giving companies that invest in them a significant competitive advantage.

For example, one SaaS solution provider uses audio intelligence to power its Conversation Intelligence software that provides pay-per-click, SEO, marketing call tracking, and automated insights for phone calls. Not only is the provider itself able to charge more for its intelligent product, it also enables others to optimize their marketing spend and increase ROI on more targeted ad placements.

Another call center company uses audio intelligence for what they refer to as predictive behavioral routing of their calls. Predictive behavioral routing “analyzes the speech patterns of callers and matches them up with people in the call center that have ‘compatible’ personality types,” significantly increasing the number of successful calls that take place and in turn, increasing customer loyalty.

WhatConverts, a lead tracking and reporting company, uses speech-to-text transcription to automatically create accurate call transcripts. Then the company applies audio intelligence to qualify leads, identify quotable leads, and flag leads for follow-up. This automated process speeds up the lead qualification process and increases conversion rates.

Let’s dive deeper into the different audio intelligence features on the market today: 

Current Audio Intelligence Features

  • Sentiment analysis
  • Topic detection
  • Content safety detection
  • PII redaction
  • Summarization
  • Entity detection


Sentiment Analysis

Sentiment analysis detects the sentiment, typically positive, negative, or neutral, of speech segments in an audio or video file. For call centers, sentiment analysis is often used to analyze attributed feelings in customer-agent conversations. This could be to better track customer attitudes toward a product, service, or even the agent, helping companies make more informed marketing decisions, facilitate better agent training, and improve customer satisfaction. 


Topic Detection

Topic detection automatically identifies and labels topics in audio or video files, as denoted by the IAB (Interactive Advertising Bureau) Content Taxonomy. With topic detection, communications companies can analyze transcripts more easily to more effectively engage in contextual and behavioral advertising and targeting. This intelligent targeting directly translates into greater lead conversions.


Content Safety Detection

Content safety detection lets users identify and filter audio or video content for sensitive and harmful information such as violence, hate speech, alcohol, drugs, and more. This is especially useful for online content moderation and for vetting advertorial placements. 


Personally identifiable information Redaction

Personally identifiable information (PII) redaction automatically identifies and removes PII,  such as addresses, social security numbers, and credit card numbers from a transcription. This helps communications companies better adhere to privacy and security laws or to meet internal policy requirements, so that customers can be confident that all data is handled with proper care. 



Summarization breaks audio or video file transcripts into logical chapters (like when the conversation changes topics) and then automatically generates a summary for each of these chapters — sort of like the Cliff’s Notes of a transcription. For call centers, this can make phone calls easier to navigate and make it easier to perform QA when needed. Virtual meeting platforms use auto chapters to easily attain more digestible meeting summaries, for postmortem discussions, and for analytical applications. 


Entity Detection

Entity detection locates and classifies entities within a transcription text. For example, Seattle is an entity that would be classified as a location. Communications platforms use entity detection to automatically populate certain fields, categorize and analyze conversations, and improve customer response time. Voice bots use entity detection to trigger actions that automate and personalize interactions based on a specific entity detected, such as an individual or company name. 


The Future of Cloud Communications

Both speech-to-text transcription and audio intelligence features are promising areas of investment for cloud communications companies looking to drive innovation in the field, maximize ROI, and secure a competitive position.

Additional AI-backed ASR features in the works will only spur this innovation further. Look for features such as emotion detection, which will let companies analyze more specific emotions like anger, elation, frustration, or satisfaction in a transcription text, as well as intent recognition and more to aid analytical power and further advance the industry.

Read More From Dylan Fox on Built In’s Expert Contributor Network 3 Biggest Mistakes to Avoid When Hiring AI and ML Engineers

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us