“Hey, Siri, what’re the top priorities on my schedule today?”
Today, we may think of this as a relatively simple request for our handy voice assistants to accomplish. And yet, it’s the culmination of decades of progress in Automatic Speech Recognition (ASR) technology.
So, what is ASR? At its essence, ASR involves teaching machines to listen and understand our words, then act. But how did we even reach the point where a phrase like “OK Google” could unlock a world of capabilities?
From Thomas Edison’s dictation machine to Siri’s casual chat, the ASR history journey is a tale of innovation and transformation that transformed not just the gadgets in our pockets but how we function in modern society.
Let’s trace the steps that brought voice commands to life.
1800s: The Origins of Dictation
- 1879 – Inventor Thomas Edison unveils the world’s first dictation machine
1950s: The Infancy of Speech Recognition Technology
- 1952 – Bell Laboratories introduces the “Audrey” machine, an innovative device with the remarkable ability to recognize the spoken voice of its developer, HK Davis, accurately identifying digits from 0 to 9 with an impressive accuracy rate exceeding 90%.
1960s: From Digits to Spoken Words
- 1962 – IBM showcases its “Shoebox” machine, which is capable of understanding 16 English words as well as numerical digits.
1970s: From Words to Complete Sentences
- 1971-1976 – The DARPA Speech Understanding Research (SUR) initiative, created by the U.S. Department of Defense, was used to advance voice recognition technologies with potential uses in both military and civilian contexts.
- 1976 – Researchers at Carnegie Mellon University develop “Harpy,” an advanced ASR system built on the foundations of DARPA. It was able to comprehend 1,011 words and recognize some complete sentences, marking a significant advancement in the field of speech recognition tech.
1980s: From a Few Hundred Words to Several Thousand | The Decade of HMM
The ’80s were a pivotal decade for ASR technologies. The introduction of the statistical method known as the “Hidden Markov Model (HMM)” became an inflection point that revolutionized language modeling, enhancing accuracy and laying the groundwork for even more sophisticated ASR.
Unlike traditional approaches that relied solely on word recognition and sound patterns, HMM brought a new dimension by enabling the prediction of the most probable phonemes to follow a given phoneme.
- Mid ‘80s – IBM introduces a voice-activated typewriter named “Tangora” based on HMM that boasted a 20,000 spoken-word vocabulary.
- 1987 – World of Wonders releases Julie: the world’s most intelligent doll. This was the first “fully interactive toy” due to its DSP chip that allowed it to respond to and generate basic speech.
1990s: Microprocessors Change the ASR Game
Before the ‘90s, Automatic Speech Recognition systems relied on discrete dictation, which required the speaker to pause after every single word to ensure that the technology could accurately recognize each word. However, in the 1990s, a transformative shift occurred in the field of automatic speech recognition (ASR) with the advent of microprocessors, which enabled faster and more accurate speech pattern recognition.
- 1990 – “Dragon Dictate” makes history as the world’s first speech recognition software tailored for consumer use, revolutionizing how individuals interact with computers and opening the door to a new era of speech-enabled applications.
- 1997 – Later in the decade, Dragon develops “Dragon Naturally Speaking,” the first continuous speech recognition product capable of understanding continuous speech of up to 100 words per minute.
2000s: Speech Recognition Technology Becomes Faster and More Accurate
While the technology was continuously evolving over the decade, the most significant milestone was the introduction of Google’s Voice Search app. Tens of millions of users were exposed to speech recognition technology due to this app. Google was also able to gather petabytes of voice data that could be used to advance technology and boost predictions.
2010s: The Digital Assistant Explosion
The 2010s witnessed a remarkable and rapid rise in smartphone market penetration. At the start of the decade, just 20% of the population owned smartphones. But, in a few short years, the technology progressed rapidly, making smartphones an indispensable part of daily life. By the year 2020, an astounding 72.2% of the population had a smartphone in their pocket.1
Because of this, there was a large increase in speech recognition software and apps, sparked by the release of smart speakers and digital assistants like Siri or Alexa.
- 2011 – Apple introduces and launches Siri, the world’s first intelligent digital assistant on a phone.
- 2017 – Google’s machine learning algorithms achieve 95% English word accuracy rate, which is equivalent to human capabilities.
2020s: AI and ASR
The 2010s may have been defined by the rise of digital assistants, but the 2020s are shaping up to be the decade where AI’s influence on ASR technology becomes truly transformative, in terms of both acoustics and semantics. Some of the notable advancements include:2,3,4
- Optimization Techniques – By harnessing innovations such as Faster-whisper and NVIDIA-wav2vec2, the ASR industry has been able to significantly reduce both training and inference times while making ASR tech more accessible and deployable.
- Generative AI – Generative AI is heralding a revolution in human-digital interaction, employing avatars, Textless NLP, and innovative models like VALL-E for direct audio processing, voice cloning, and flexible, context-aware applications.
- Conversational AI – Conversational AI is rapidly advancing with personal assistants like Alexa and Siri, evolving from text-based systems to sophisticated voice-based interfaces, with a focus on interoperability, nuanced communication, support for diverse accents, and multi-task learning frameworks for a wide array of spoken language tasks.
- Global reach with multilingual ASR – With the introduction of multilingual speech recognition systems, companies are now making their applications and services available to a global audience.
- Enhanced accessibility through automated captions – Live video content is now more accessible and inclusive than ever thanks to automated captioning.
- AI-driven accuracy enhancements – AI continues to drive unprecedented levels of accuracy in advanced speech recognition technology. Through continuous learning and adaptation, ASR software is becoming more intuitive and responsive, paving the way for future innovations.
The Future of ASR Technologies
The evolution of AI technologies like machine learning (ML), deep learning (DL), natural language processing (NLP), neural networks, and ASR has accelerated exponentially in recent years. This rapid growth is pushing the boundaries of speech recognition and is poised to continue its transformational influence at an unprecedented pace over the coming decade.
With many practical applications in the legal industry—from contract review and negotiation to litigation prediction and analytics, legal research, transcription and more—speech recognition and artificial intelligence in law firms will continue to gain adoption.
- Statista. Smart Phone Penetration Rate as Share of the Population in the US. https://www.statista.com/statistics/201183/forecast-of-smartphone-penetration-in-the-us/
- Towards Data Science. Overcoming Automatic Speech Recognition Challenges: The Next Frontier. https://towardsdatascience.com/overcoming-automatic-speech-recognition-challenges-the-next-frontier-e26c31d643cc
- NVIDIA. Essential Guide To Automatic Speech Recognition Technology. https://developer.nvidia.com/blog/essential-guide-to-automatic-speech-recognition-technology/
- Customerzone360. AI Is Driving Greater Accuracy in Advanced Speech Recognition. https://www.customerzone360.com/topics/customer/articles/455823-ai-driving-greater-accuracy-advanced-speech-recognition.htm#