Speech Recognition - Ready for Prime Time?
by Jarred Walton on April 21, 2006 9:00 AM EST- Posted in
- Smartphones
- Mobile
Closing Thoughts
At first glance, both of these speech recognition packages appear pretty reasonable. Dragon is more accurate in transcribe mode, but it also requires more processing time. Both also manage to offer better than 90% accuracy, but as stated earlier that really isn't that great. Having used speech-recognition for several months now, I would say that 95% accuracy is the bare minimum you want to achieve, and more is better. If you already have Microsoft Office 2003, the performance offered might be enough to keep you happy. I can't say that I would be happy with it, unfortunately.
It may simply be that I started with Dragon NaturallySpeaking, but so far every time I've tried to use Microsoft's speech tool, I've been frustrated with the interface. Microsoft does appear to do better when you start speaking rapidly, but I generally only speak in a clause or at most a sentence at a time, and the Microsoft speech engine doesn't seem to do as well with that style of delivery.
Going back to my earlier analogy of the mouse wheel, I can't help but feel that it's something of the same thing. Having experienced the way Dragon does things, the current Microsoft interface is a poor substitute. It almost feels as though the utility reached the point where it was "good enough" and has failed to progress from there. The accuracy is fine, but dealing with the errors and training the tool to properly recognize words for future use is unintuitive at best - I didn't even have to crack the manual for DNS until I wanted to get into some specialized commands! That said, Windows Vista and the new Office Live are rumored to have better speech support, so I will definitely look into those in the future.
In case you were wondering, the vast majority of this article was written using Dragon NaturallySpeaking. There are still many errors being made by the software, but there are far more errors being made by the user. Basically, the software works best if you can think and speak in complete phrases/clauses, as that gives the software a better chance to recognize words from context. Any stuttering, slight pauses, slurring, etc. can dramatically impact the accuracy. It's also important to enunciate your words -- it will certainly help improve the accuracy if you can do so.
I find DNS works very well for my purposes, but I'm sure there are people out there that will find it less than ideal. There are also various language packs available, including several dialects of English, but I can't say how well any of them work from personal experience. I don't really feel that I can say a lot about getting the most out of speech recognition from Microsoft, so the remainder of my comments come from my use of Dragon NaturallySpeaking.
Thoughts on Dragon NaturallySpeaking
Note: I've added some additional commentary on the subject based on email conversations.
One of the questions that many people have is what's the best microphone setup to use? I use a Plantronics headset that I picked up online for about $30, and it hasn't given me any cause for concern. The soft padded earphones are definitely a plus if you're going to use your headset for long periods of time. I've also used a Logitech headset, and it worked fine, but it's not as comfortable for extended use. The location of the microphone -- somewhat closer to your ear than to your mouth -- also seems to make it less appropriate for noisy environments. There are nicer microphones out there, including models that feature active noise cancellation (ANC), and they could conceivably help -- especially in noisy environments. So far, I at least have not found them to be necessary for my needs.
As far as sound hardware goes, I've used integrated Realtek ALC655 audio, integrated Creative Live! 24-bit, and a discrete Creative Audigy 2 ZS card. The ALC655 definitely has some static and popping noises, but it didn't seem to affect speech recognition in any way that I could see. The quality of integrated audio varies by motherboard, of course, but I would suggest you try out whenever you currently have first. Whatever sound card/chipset you're using ought to be sufficient; if it's not, you could try a USB sound pod or upgrade to a nicer sound card that has a better signal to noise ratio. (My Audigy 2ZS works great - that's what was used for recording the sample audio files.)
There's also a recommendation to make sure you're in a quiet environment in order to get best results. I'm not exactly sure what qualifies as quiet, but my living room with an ambient noise level of 50 to 60 dB doesn't appear to present any difficulties. On the other hand, I did try using speech recognition in a data center that had an ambient noise level of over 75 dB. I received a warning that the noise level was too high during the microphone configuration process, but I did have the option to continue. I did so, but within minutes I had to admit defeat. I could either shout at my microphone, or else I could try to get by with less than 50% accuracy rates.
I would say there's a reasonable chance that active noise cancellation microphones could handle such an environment, but I didn't have a need to actually make that investment. In fact, several SRS experts have emailed me and recommended ANC for exactly that sort of situation. It should allow you to use speech recognition even in noisy environments, and it can also help improve accuracy even in quieter environments. If you're serious about using such software, it's worth a look. However, few locations are as noisy as a data center -- trade shows certainly are close -- so most people won't have to worry about excessive noise. Most office environments should be fine, provided you don't mind people overhearing you.
The one area that continues to plague DNS in terms of recognition errors is acronyms. Some do fine, but there are many that continue to be recognized incorrectly, even after multiple attempts to train the software. For example, "Athlon 64 ex 240 800+" is what I get every time I want "Athlon 64 X2 4800+" -- that's when I say "Athlon 64 X2 forty-eight-hundred plus-sign". I can normally get the proper text if I say "Athlon 64 X2 four-thousand eight-hundred plus-sign", but I still frequently get "ask to" or "ax to" instead of "X2". My recent laptop review also generated a lot of "and/end 1710" instead of "M1710", despite my best attempts to train the software. There's a solution for this: creating custom "macros" for acronyms you use a lot. The only difficulty is that you have to spend the time initially to set up the macros, but for long-term use its definitely recommended.
My final comment for now is that the functionality provided by Dragon NaturallySpeaking is far better in Word or DragonPad (the integrated rich text editor that comes with DNS) than in most other applications. If you're using Word or DragonPad, it's relatively simple to go back and correct errors without touching the keyboard. All you have to do is say "select XXX" where XXX is a word or phrase that occurs in the document. DNS will generally select the nearest occurrence of that phrase, and you're presented with a list of choices for correcting it, or you can just speak the new text you want to replace the original text. This is one of those intuitive things I was talking about that Microsoft currently lacks.
There are a few problems with this system. The biggest is that outside of Word and DragonPad, the instant you touch the keyboard or switch to another application, DNS loses all knowledge about any of the previous text. This happens a lot with web browsers and instant messaging clients -- I surf almost exclusively with Firefox, so I can't say whether or not this holds true for Internet Explorer. Another problem is that sometimes the selection gets off by one character, so you end up deleting the last character of the previous word and getting an extra space on the other side. (This only happens outside of Word/DragonPad, as far as I can tell.) I've also had a few occasions where the system goes into "slow motion" when I try to make a correction: the text will start to be selected one character at a time, at a rate of about two characters per second, and then once all the text is selected it has to be unselected again one character at a time. If I'm trying to select a larger phrase, I'm best off just walking away from the computer for a few minutes. (Screaming and yelling at my system doesn't help, unfortunately.) Thankfully, both of those glitches only occur rarely.
Hopefully, some of you will have found this article to be an interesting look at a technology that has continued to improve through the years. It's still not perfect, but speech recognition software has become a regular part of my daily routine. There are certainly people out there who type more than I do, and I would definitely recommend that many of them take a look at speech recognition software. If you happen to be experiencing some RSI/carpal tunnel issues caused by typing, that recommendation increases a hundredfold. I'm certainly no doctor, but the expression "No pain, no gain" isn't always true; some types of pain are your body's way of telling you to knock it off.
If you have any further questions about my experience with speech recognition, send them my way. I don't think I'm a bona fide expert on the subject, but I'll be happy to offer some help if I can.
At first glance, both of these speech recognition packages appear pretty reasonable. Dragon is more accurate in transcribe mode, but it also requires more processing time. Both also manage to offer better than 90% accuracy, but as stated earlier that really isn't that great. Having used speech-recognition for several months now, I would say that 95% accuracy is the bare minimum you want to achieve, and more is better. If you already have Microsoft Office 2003, the performance offered might be enough to keep you happy. I can't say that I would be happy with it, unfortunately.
It may simply be that I started with Dragon NaturallySpeaking, but so far every time I've tried to use Microsoft's speech tool, I've been frustrated with the interface. Microsoft does appear to do better when you start speaking rapidly, but I generally only speak in a clause or at most a sentence at a time, and the Microsoft speech engine doesn't seem to do as well with that style of delivery.
Going back to my earlier analogy of the mouse wheel, I can't help but feel that it's something of the same thing. Having experienced the way Dragon does things, the current Microsoft interface is a poor substitute. It almost feels as though the utility reached the point where it was "good enough" and has failed to progress from there. The accuracy is fine, but dealing with the errors and training the tool to properly recognize words for future use is unintuitive at best - I didn't even have to crack the manual for DNS until I wanted to get into some specialized commands! That said, Windows Vista and the new Office Live are rumored to have better speech support, so I will definitely look into those in the future.
In case you were wondering, the vast majority of this article was written using Dragon NaturallySpeaking. There are still many errors being made by the software, but there are far more errors being made by the user. Basically, the software works best if you can think and speak in complete phrases/clauses, as that gives the software a better chance to recognize words from context. Any stuttering, slight pauses, slurring, etc. can dramatically impact the accuracy. It's also important to enunciate your words -- it will certainly help improve the accuracy if you can do so.
I find DNS works very well for my purposes, but I'm sure there are people out there that will find it less than ideal. There are also various language packs available, including several dialects of English, but I can't say how well any of them work from personal experience. I don't really feel that I can say a lot about getting the most out of speech recognition from Microsoft, so the remainder of my comments come from my use of Dragon NaturallySpeaking.
Thoughts on Dragon NaturallySpeaking
Note: I've added some additional commentary on the subject based on email conversations.
One of the questions that many people have is what's the best microphone setup to use? I use a Plantronics headset that I picked up online for about $30, and it hasn't given me any cause for concern. The soft padded earphones are definitely a plus if you're going to use your headset for long periods of time. I've also used a Logitech headset, and it worked fine, but it's not as comfortable for extended use. The location of the microphone -- somewhat closer to your ear than to your mouth -- also seems to make it less appropriate for noisy environments. There are nicer microphones out there, including models that feature active noise cancellation (ANC), and they could conceivably help -- especially in noisy environments. So far, I at least have not found them to be necessary for my needs.
As far as sound hardware goes, I've used integrated Realtek ALC655 audio, integrated Creative Live! 24-bit, and a discrete Creative Audigy 2 ZS card. The ALC655 definitely has some static and popping noises, but it didn't seem to affect speech recognition in any way that I could see. The quality of integrated audio varies by motherboard, of course, but I would suggest you try out whenever you currently have first. Whatever sound card/chipset you're using ought to be sufficient; if it's not, you could try a USB sound pod or upgrade to a nicer sound card that has a better signal to noise ratio. (My Audigy 2ZS works great - that's what was used for recording the sample audio files.)
There's also a recommendation to make sure you're in a quiet environment in order to get best results. I'm not exactly sure what qualifies as quiet, but my living room with an ambient noise level of 50 to 60 dB doesn't appear to present any difficulties. On the other hand, I did try using speech recognition in a data center that had an ambient noise level of over 75 dB. I received a warning that the noise level was too high during the microphone configuration process, but I did have the option to continue. I did so, but within minutes I had to admit defeat. I could either shout at my microphone, or else I could try to get by with less than 50% accuracy rates.
I would say there's a reasonable chance that active noise cancellation microphones could handle such an environment, but I didn't have a need to actually make that investment. In fact, several SRS experts have emailed me and recommended ANC for exactly that sort of situation. It should allow you to use speech recognition even in noisy environments, and it can also help improve accuracy even in quieter environments. If you're serious about using such software, it's worth a look. However, few locations are as noisy as a data center -- trade shows certainly are close -- so most people won't have to worry about excessive noise. Most office environments should be fine, provided you don't mind people overhearing you.
The one area that continues to plague DNS in terms of recognition errors is acronyms. Some do fine, but there are many that continue to be recognized incorrectly, even after multiple attempts to train the software. For example, "Athlon 64 ex 240 800+" is what I get every time I want "Athlon 64 X2 4800+" -- that's when I say "Athlon 64 X2 forty-eight-hundred plus-sign". I can normally get the proper text if I say "Athlon 64 X2 four-thousand eight-hundred plus-sign", but I still frequently get "ask to" or "ax to" instead of "X2". My recent laptop review also generated a lot of "and/end 1710" instead of "M1710", despite my best attempts to train the software. There's a solution for this: creating custom "macros" for acronyms you use a lot. The only difficulty is that you have to spend the time initially to set up the macros, but for long-term use its definitely recommended.
My final comment for now is that the functionality provided by Dragon NaturallySpeaking is far better in Word or DragonPad (the integrated rich text editor that comes with DNS) than in most other applications. If you're using Word or DragonPad, it's relatively simple to go back and correct errors without touching the keyboard. All you have to do is say "select XXX" where XXX is a word or phrase that occurs in the document. DNS will generally select the nearest occurrence of that phrase, and you're presented with a list of choices for correcting it, or you can just speak the new text you want to replace the original text. This is one of those intuitive things I was talking about that Microsoft currently lacks.
There are a few problems with this system. The biggest is that outside of Word and DragonPad, the instant you touch the keyboard or switch to another application, DNS loses all knowledge about any of the previous text. This happens a lot with web browsers and instant messaging clients -- I surf almost exclusively with Firefox, so I can't say whether or not this holds true for Internet Explorer. Another problem is that sometimes the selection gets off by one character, so you end up deleting the last character of the previous word and getting an extra space on the other side. (This only happens outside of Word/DragonPad, as far as I can tell.) I've also had a few occasions where the system goes into "slow motion" when I try to make a correction: the text will start to be selected one character at a time, at a rate of about two characters per second, and then once all the text is selected it has to be unselected again one character at a time. If I'm trying to select a larger phrase, I'm best off just walking away from the computer for a few minutes. (Screaming and yelling at my system doesn't help, unfortunately.) Thankfully, both of those glitches only occur rarely.
Hopefully, some of you will have found this article to be an interesting look at a technology that has continued to improve through the years. It's still not perfect, but speech recognition software has become a regular part of my daily routine. There are certainly people out there who type more than I do, and I would definitely recommend that many of them take a look at speech recognition software. If you happen to be experiencing some RSI/carpal tunnel issues caused by typing, that recommendation increases a hundredfold. I'm certainly no doctor, but the expression "No pain, no gain" isn't always true; some types of pain are your body's way of telling you to knock it off.
If you have any further questions about my experience with speech recognition, send them my way. I don't think I'm a bona fide expert on the subject, but I'll be happy to offer some help if I can.
38 Comments
View All Comments
Googer - Saturday, April 22, 2006 - link
BMW 7 series Speech recognition is about 50-75% accurate (my guess) and some users have more luck with it than others.Googer - Friday, April 21, 2006 - link
I think you should re-benchmark these on a system that is not overclocked. Overclocking may have contibuted to errouneous test results. It is possible that some of the benchmarks could have been better on a normal system. Also I am surprised this was not tested on a Intel Syststem. Prehaps one of the programs may benefit from the Netburst Architeture with or with out dual core.Also I would love to download the Dication and Normal Voice wav files, so I can understand the differance between them. Thanks for the article, it came in perfect time; Someone who is handicaped was asking me about this last night.
JarredWalton - Friday, April 21, 2006 - link
I'll see about putting up some MP3s of the wave files -- of course, that will open the door for all of you to make fun of how I speak. LOLIn case this wasn't entirely clear in article, this was all done on my system that I use every day for work. It's overclocked, and it's been that way for six months. I run stress tests (Folding at Home -- on both cores) all the time. I would be very surprised if the overclock has done anything to affect accuracy, especially considering that I did run some tests on a couple other systems that were not overclocked, and basically removed them from this article because they would have simply taken more time to put in the article, and they didn't give me any new information.
It's pretty obvious that neither of these algorithms benefit from multiple processing cores -- HyperThreading, dual core, SMP, whatever. I also wasn't sure how much interest there would be from people in this topic, but if a lot of people want to know how this runs on Intel systems I could go back and look at one. One thing worth noting is that SysMark 2004 does include Dragon NaturallySpeaking version 6.5 as one of the tests. Of course, the results are buried in the composite scores.
JarredWalton - Friday, April 21, 2006 - link
MP3 links available:http://www.anandtech.com/multimedia/showdoc.aspx?i...">http://www.anandtech.com/multimedia/showdoc.aspx?i...
Note that DNS only uses WAV files (AFAICT), but uploading 45MB WAV files seems pointless. Convert them to WAVs if you want to try them with Dragon.
Googer - Saturday, April 22, 2006 - link
Excellant job on the dictation/wav files, you are a very good reader and have a nice clear and concice voice. ;ThumbsUP)stelleg151 - Friday, April 21, 2006 - link
Cool article. I hope that voice recognition continues to improve, for I think it could be incredibly useful for areas like HTPC, or as you said messenging while doing other things (gaming).Zerhyn - Friday, April 21, 2006 - link
Have you ever tried out speech recognition and been underwhelmed? To you yearn to play the role of Scotty and call out..?
PrinceGaz - Friday, April 21, 2006 - link
Yes, that was the first thing I noticed before I even started reading the article. Maybe they used speech-recognition software to enter that.I think they should have an editor (or at least let another contributor read what others have written) who has to approve an article before it goes live as the current number of tyops is unforgiveable ;)
JarredWalton - Friday, April 21, 2006 - link
I'm doing my best to catch typos before anything goes live, but after being up all night trying to finish off this article, I went to post and realized I didn't have a title or intro. So, I put one in using Dragon, but my diction goes to put when I'm tired, as does my eyesight and proofing ability. One typo in a 44 word intro (I didn't proof/edit it at all) isn't too bad for the software. Bad for me? Maybe, but mistakes do happpen. :)johnsonx - Friday, April 21, 2006 - link
One nice thing about Dragon, despite the high CPU utilization shown in the article, is that it will run quite happily with very lowly systems. I have a customer who uses it all day long on PentiumIII-850's with only 512Mb RAM (the max for those particular systems). The heaviest user there recently upgraded to a low-end Sempron64 with a gig of RAM, and he says the overall system is far more responsive (of course), but Dragon's operation isn't radically better; it worked great on the PIII, and works great now.