Speech Recognition - Ready for Prime Time?
by Jarred Walton on April 21, 2006 9:00 AM EST- Posted in
- Smartphones
- Mobile
Accuracy Testing
In order to try and keep this article coherent, I decided to cut back on the number of test results and reporting. I started doing some comparisons of trained versus untrained installations, but untrained installations are really a temporary solution, since the software will learn as you use it. I have my Dragon installation that I've been using for a while, so that side of the equation is covered. I haven't used Microsoft's speech recognition package nearly as much, but I wanted to make sure I gave it a reasonable chance, so I went through additional training sessions with Office 2003. I also opened several of my articles and had the speech engine learn from their content.
One major advantage of DNS is that it will scan your My Documents folder when you first configure it, and as far as I can tell it adds most of the words in your text documents into its recognition engine. Microsoft Office's speech tool can do this as well, but you have to do it manually, one document at a time. I wanted to be fair to both products, but eventually my patience with Microsoft Office 2003 ran out, so it's not as "trained" as DNS8.
Both Dragon and Microsoft Office have the ability to adjust the speed of speech recognition against accuracy, so I tested performance and accuracy at numerous settings. For Dragon, there are essentially six settings, ranging from minimum accuracy to maximum accuracy. The slider can be adjusted in smaller increments, but if you click in the slider bar it will jump between six positions, with each one bringing a moderate change in performance, and possibly a change in accuracy.
I tested at all six settings, but I'm only going to report results for the minimum, medium, and maximum accuracy scores in the charts. Dragon also has the ability to transcribe a recording directly from a WAV file at maximum speed, so I'll include a separate chart for that. Microsoft's speech engine also has a linear slider, but I chose to limit testing to maximum accuracy, minimum accuracy, as well as the middle value. If you would like to see the other test results, the text is available in this Zip file (1 MB).
At the request of some readers, I have also made the MP3 files available for download. (Don't make fun of my voice recordings without making some of your own, though!)
Precise Dictation (5.3MB)
Natural/Rapid Dictation (4.4 MB)
All of these tests were performed on the X2 system with the "trained" speech profiles. I would like to try to train Microsoft's tool more, but it just doesn't have a very intuitive interface. When you say a word or phrase that DNS doesn't recognize, you simply say "spell that" and provide the correct spelling. In most instances, that will allow DNS to recognize the word(s) in the future. This is particularly useful for names of family/friends/associates/etc. Acronyms can also be trained in this manner, but many acronyms sound similar to other standard words, and they definitely cause recognition difficulties. For example, "Athlon X2" still often comes out as "Athlon axe two" and "SATA" (pronounced, not spelled out) is still recognized as "say to" or "say that".
My experience with using Microsoft's speech tool is that it is best used for rough drafts and that you shouldn't worry about correcting errors initially. Once you've got the basic text in place, then you should go through and manually edit the errors. That's basically what Microsoft's training wizard tells you as well, so immediately their goals seem less ambitious - and thus their market is also more limited. Luckily, the text being dictated here isn't as complex that in some of my articles, so Microsoft does pretty well.
Dictation Accuracy
Both packages clearly meet the 90% or higher accuracy claims with practiced dictation. Once you get above 90%, though, every additional accuracy point becomes exponentially more difficult to acquire. With that in mind, the 96% accuracy achieved is impressive. The more specialized your dictation, the higher your chance for getting errors, but for general language both are capable. Somewhat interesting is that the maximum accuracy settings don't actually improve things in all cases. The lowest accuracy setting usually does the worst, but everything above the Medium setting (the default) seems to get both better and worse - some phrases are corrected, and others suddenly get misinterpreted.
The final thing to consider is that in all cases the computer is able to keep up with the user - though maximum accuracy on DNS barely manages to do so. The sound file being dictated here is 9:21 in length and contains 1181 words. At that rate, the software is handling 126 wpm, which is far faster than most people can type. If you're one of the "hunt and peck" crowd, and you find yourself in a situation where you have to do a lot more typing, you might seriously consider trying speech recognition.
Transcription Accuracy
Perhaps the fact that the transcription mode doesn't have to deal with commands and real-time interfacing with the user helps improve accuracy. It may also be that reading a WAV file directly as opposed to hearing it through a microphone helps accuracy. Regardless, it's clear that the transcription mode offers better accuracy than any of the dictation modes. If you're looking at reduction of errors, transcribing a file is 100% more accurate than dictating a file.
Realistically, transcription mode is only useful if you plan on dictating into a recording device while you're away from your computer. Otherwise, you simply spend time dictating a recording, have Dragon transcribe it, and then check for errors. The quality of your recording will also play a role, so if you're using a small portable music device with a tiny microphone, or if you're recording in a noisy environment, it's unlikely that you actually get better accuracy rates compared to sitting in front of a computer dictating into a headset.
There's also some question of how good the transcription mode would be at handling something like the minutes of a meeting, where you have numerous voices, accents, males and females, etc. Still, while you may not use the transcribe mode all that often, we would rather have it than not. Microsoft's speech SDK looks like it has the necessary hooks to allow transcription of a WAV file, but at present we were unable to find any utilities that take advantage of this feature.
In order to try and keep this article coherent, I decided to cut back on the number of test results and reporting. I started doing some comparisons of trained versus untrained installations, but untrained installations are really a temporary solution, since the software will learn as you use it. I have my Dragon installation that I've been using for a while, so that side of the equation is covered. I haven't used Microsoft's speech recognition package nearly as much, but I wanted to make sure I gave it a reasonable chance, so I went through additional training sessions with Office 2003. I also opened several of my articles and had the speech engine learn from their content.
One major advantage of DNS is that it will scan your My Documents folder when you first configure it, and as far as I can tell it adds most of the words in your text documents into its recognition engine. Microsoft Office's speech tool can do this as well, but you have to do it manually, one document at a time. I wanted to be fair to both products, but eventually my patience with Microsoft Office 2003 ran out, so it's not as "trained" as DNS8.
Both Dragon and Microsoft Office have the ability to adjust the speed of speech recognition against accuracy, so I tested performance and accuracy at numerous settings. For Dragon, there are essentially six settings, ranging from minimum accuracy to maximum accuracy. The slider can be adjusted in smaller increments, but if you click in the slider bar it will jump between six positions, with each one bringing a moderate change in performance, and possibly a change in accuracy.
I tested at all six settings, but I'm only going to report results for the minimum, medium, and maximum accuracy scores in the charts. Dragon also has the ability to transcribe a recording directly from a WAV file at maximum speed, so I'll include a separate chart for that. Microsoft's speech engine also has a linear slider, but I chose to limit testing to maximum accuracy, minimum accuracy, as well as the middle value. If you would like to see the other test results, the text is available in this Zip file (1 MB).
At the request of some readers, I have also made the MP3 files available for download. (Don't make fun of my voice recordings without making some of your own, though!)
Precise Dictation (5.3MB)
Natural/Rapid Dictation (4.4 MB)
All of these tests were performed on the X2 system with the "trained" speech profiles. I would like to try to train Microsoft's tool more, but it just doesn't have a very intuitive interface. When you say a word or phrase that DNS doesn't recognize, you simply say "spell that" and provide the correct spelling. In most instances, that will allow DNS to recognize the word(s) in the future. This is particularly useful for names of family/friends/associates/etc. Acronyms can also be trained in this manner, but many acronyms sound similar to other standard words, and they definitely cause recognition difficulties. For example, "Athlon X2" still often comes out as "Athlon axe two" and "SATA" (pronounced, not spelled out) is still recognized as "say to" or "say that".
My experience with using Microsoft's speech tool is that it is best used for rough drafts and that you shouldn't worry about correcting errors initially. Once you've got the basic text in place, then you should go through and manually edit the errors. That's basically what Microsoft's training wizard tells you as well, so immediately their goals seem less ambitious - and thus their market is also more limited. Luckily, the text being dictated here isn't as complex that in some of my articles, so Microsoft does pretty well.
Dictation Accuracy
Both packages clearly meet the 90% or higher accuracy claims with practiced dictation. Once you get above 90%, though, every additional accuracy point becomes exponentially more difficult to acquire. With that in mind, the 96% accuracy achieved is impressive. The more specialized your dictation, the higher your chance for getting errors, but for general language both are capable. Somewhat interesting is that the maximum accuracy settings don't actually improve things in all cases. The lowest accuracy setting usually does the worst, but everything above the Medium setting (the default) seems to get both better and worse - some phrases are corrected, and others suddenly get misinterpreted.
The final thing to consider is that in all cases the computer is able to keep up with the user - though maximum accuracy on DNS barely manages to do so. The sound file being dictated here is 9:21 in length and contains 1181 words. At that rate, the software is handling 126 wpm, which is far faster than most people can type. If you're one of the "hunt and peck" crowd, and you find yourself in a situation where you have to do a lot more typing, you might seriously consider trying speech recognition.
Transcription Accuracy
Perhaps the fact that the transcription mode doesn't have to deal with commands and real-time interfacing with the user helps improve accuracy. It may also be that reading a WAV file directly as opposed to hearing it through a microphone helps accuracy. Regardless, it's clear that the transcription mode offers better accuracy than any of the dictation modes. If you're looking at reduction of errors, transcribing a file is 100% more accurate than dictating a file.
Realistically, transcription mode is only useful if you plan on dictating into a recording device while you're away from your computer. Otherwise, you simply spend time dictating a recording, have Dragon transcribe it, and then check for errors. The quality of your recording will also play a role, so if you're using a small portable music device with a tiny microphone, or if you're recording in a noisy environment, it's unlikely that you actually get better accuracy rates compared to sitting in front of a computer dictating into a headset.
There's also some question of how good the transcription mode would be at handling something like the minutes of a meeting, where you have numerous voices, accents, males and females, etc. Still, while you may not use the transcribe mode all that often, we would rather have it than not. Microsoft's speech SDK looks like it has the necessary hooks to allow transcription of a WAV file, but at present we were unable to find any utilities that take advantage of this feature.
38 Comments
View All Comments
Googer - Saturday, April 22, 2006 - link
BMW 7 series Speech recognition is about 50-75% accurate (my guess) and some users have more luck with it than others.Googer - Friday, April 21, 2006 - link
I think you should re-benchmark these on a system that is not overclocked. Overclocking may have contibuted to errouneous test results. It is possible that some of the benchmarks could have been better on a normal system. Also I am surprised this was not tested on a Intel Syststem. Prehaps one of the programs may benefit from the Netburst Architeture with or with out dual core.Also I would love to download the Dication and Normal Voice wav files, so I can understand the differance between them. Thanks for the article, it came in perfect time; Someone who is handicaped was asking me about this last night.
JarredWalton - Friday, April 21, 2006 - link
I'll see about putting up some MP3s of the wave files -- of course, that will open the door for all of you to make fun of how I speak. LOLIn case this wasn't entirely clear in article, this was all done on my system that I use every day for work. It's overclocked, and it's been that way for six months. I run stress tests (Folding at Home -- on both cores) all the time. I would be very surprised if the overclock has done anything to affect accuracy, especially considering that I did run some tests on a couple other systems that were not overclocked, and basically removed them from this article because they would have simply taken more time to put in the article, and they didn't give me any new information.
It's pretty obvious that neither of these algorithms benefit from multiple processing cores -- HyperThreading, dual core, SMP, whatever. I also wasn't sure how much interest there would be from people in this topic, but if a lot of people want to know how this runs on Intel systems I could go back and look at one. One thing worth noting is that SysMark 2004 does include Dragon NaturallySpeaking version 6.5 as one of the tests. Of course, the results are buried in the composite scores.
JarredWalton - Friday, April 21, 2006 - link
MP3 links available:http://www.anandtech.com/multimedia/showdoc.aspx?i...">http://www.anandtech.com/multimedia/showdoc.aspx?i...
Note that DNS only uses WAV files (AFAICT), but uploading 45MB WAV files seems pointless. Convert them to WAVs if you want to try them with Dragon.
Googer - Saturday, April 22, 2006 - link
Excellant job on the dictation/wav files, you are a very good reader and have a nice clear and concice voice. ;ThumbsUP)stelleg151 - Friday, April 21, 2006 - link
Cool article. I hope that voice recognition continues to improve, for I think it could be incredibly useful for areas like HTPC, or as you said messenging while doing other things (gaming).Zerhyn - Friday, April 21, 2006 - link
Have you ever tried out speech recognition and been underwhelmed? To you yearn to play the role of Scotty and call out..?
PrinceGaz - Friday, April 21, 2006 - link
Yes, that was the first thing I noticed before I even started reading the article. Maybe they used speech-recognition software to enter that.I think they should have an editor (or at least let another contributor read what others have written) who has to approve an article before it goes live as the current number of tyops is unforgiveable ;)
JarredWalton - Friday, April 21, 2006 - link
I'm doing my best to catch typos before anything goes live, but after being up all night trying to finish off this article, I went to post and realized I didn't have a title or intro. So, I put one in using Dragon, but my diction goes to put when I'm tired, as does my eyesight and proofing ability. One typo in a 44 word intro (I didn't proof/edit it at all) isn't too bad for the software. Bad for me? Maybe, but mistakes do happpen. :)johnsonx - Friday, April 21, 2006 - link
One nice thing about Dragon, despite the high CPU utilization shown in the article, is that it will run quite happily with very lowly systems. I have a customer who uses it all day long on PentiumIII-850's with only 512Mb RAM (the max for those particular systems). The heaviest user there recently upgraded to a low-end Sempron64 with a gig of RAM, and he says the overall system is far more responsive (of course), but Dragon's operation isn't radically better; it worked great on the PIII, and works great now.