On the technical track at All Things RTC, Scott Stephenson, CEO at DeepGram gave an in-depth presentation about next-gen speech recognition, specifically providing an overview of the past, present, and future of automatic speech recognition. In other words, machines learning to do it, not humans (since we already do this all the time).
As a company, DeepGram serves enterprise clients with customized speech recognition based on deep learning, transcription, multiple channel analysis, on a massive scale. Their focus is on real-time speech recognition in a corporate environment, such as with meetings or business phone calls, often with lower fidelity audio.
Speech: Where It All Begins
As Scott pointed out, speech is the most natural form of communication. The main problem in understanding it is figuring out the signal vs. the background interference. The way that machines have been able to achieve this has evolved greatly over the past few decades.
The Past of Speech Recognition
Back in the 1980s, certain acoustic models were established, with audio input being used to compute word options based on previously established vocabulary and speech patterns. The main problem with this was the extensive word matrices involved that made the process computationally intractable far too quickly. There were, simply put, too many options that required too much processing for the technology available at the time.
Over the years, into the 1990s-2000s, Dragon Voice became known as a promising platform that involved intensive training for individual users. Yet it could never replace typing as a form of traditional input and couldn’t be scaled across multiple users or customer bases. It remained limited in scope and performance.
The Present of Speech Recognition
Speech learning algorithms continued to evolve until we reached current day, with smart homes full of devices that respond to wake words, such as Google or Alexa. While a great leap forward in performance and functionality, these devices still are based on limited vocabulary and lack the general ability to decipher context. They’re helpful in simple use cases, but require hard-coded performance parameters and aren’t able to adapt to different users or situations well or filter out background noise.
The Future of Speech Recognition
Scott predicted that the future of speech recognition will be fueled by better hardware and machine learning organization, resulting in more powerful deep learning models that can be highly customized to the user (or company).
These deep neural networks, such as those employed by DeepGram, will involve active training and learning, with models understanding jargon and getting better at identifying disparate audio sources. It will essentially become “facial recognition” for words.
This training also involves input being shown to humans who then verify the output of the audio recognition system. Eventually, these systems will not only be able to be used for things like call transcriptions, but also for speaker identification, as well as labeling emotional context and other more ephemeral outputs.