Speech Recognition - Ready for Prime Time?
by Jarred Walton on April 21, 2006 9:00 AM EST- Posted in
- Smartphones
- Mobile
Closing Thoughts
At first glance, both of these speech recognition packages appear pretty reasonable. Dragon is more accurate in transcribe mode, but it also requires more processing time. Both also manage to offer better than 90% accuracy, but as stated earlier that really isn't that great. Having used speech-recognition for several months now, I would say that 95% accuracy is the bare minimum you want to achieve, and more is better. If you already have Microsoft Office 2003, the performance offered might be enough to keep you happy. I can't say that I would be happy with it, unfortunately.
It may simply be that I started with Dragon NaturallySpeaking, but so far every time I've tried to use Microsoft's speech tool, I've been frustrated with the interface. Microsoft does appear to do better when you start speaking rapidly, but I generally only speak in a clause or at most a sentence at a time, and the Microsoft speech engine doesn't seem to do as well with that style of delivery.
Going back to my earlier analogy of the mouse wheel, I can't help but feel that it's something of the same thing. Having experienced the way Dragon does things, the current Microsoft interface is a poor substitute. It almost feels as though the utility reached the point where it was "good enough" and has failed to progress from there. The accuracy is fine, but dealing with the errors and training the tool to properly recognize words for future use is unintuitive at best - I didn't even have to crack the manual for DNS until I wanted to get into some specialized commands! That said, Windows Vista and the new Office Live are rumored to have better speech support, so I will definitely look into those in the future.
In case you were wondering, the vast majority of this article was written using Dragon NaturallySpeaking. There are still many errors being made by the software, but there are far more errors being made by the user. Basically, the software works best if you can think and speak in complete phrases/clauses, as that gives the software a better chance to recognize words from context. Any stuttering, slight pauses, slurring, etc. can dramatically impact the accuracy. It's also important to enunciate your words -- it will certainly help improve the accuracy if you can do so.
I find DNS works very well for my purposes, but I'm sure there are people out there that will find it less than ideal. There are also various language packs available, including several dialects of English, but I can't say how well any of them work from personal experience. I don't really feel that I can say a lot about getting the most out of speech recognition from Microsoft, so the remainder of my comments come from my use of Dragon NaturallySpeaking.
Thoughts on Dragon NaturallySpeaking
Note: I've added some additional commentary on the subject based on email conversations.
One of the questions that many people have is what's the best microphone setup to use? I use a Plantronics headset that I picked up online for about $30, and it hasn't given me any cause for concern. The soft padded earphones are definitely a plus if you're going to use your headset for long periods of time. I've also used a Logitech headset, and it worked fine, but it's not as comfortable for extended use. The location of the microphone -- somewhat closer to your ear than to your mouth -- also seems to make it less appropriate for noisy environments. There are nicer microphones out there, including models that feature active noise cancellation (ANC), and they could conceivably help -- especially in noisy environments. So far, I at least have not found them to be necessary for my needs.
As far as sound hardware goes, I've used integrated Realtek ALC655 audio, integrated Creative Live! 24-bit, and a discrete Creative Audigy 2 ZS card. The ALC655 definitely has some static and popping noises, but it didn't seem to affect speech recognition in any way that I could see. The quality of integrated audio varies by motherboard, of course, but I would suggest you try out whenever you currently have first. Whatever sound card/chipset you're using ought to be sufficient; if it's not, you could try a USB sound pod or upgrade to a nicer sound card that has a better signal to noise ratio. (My Audigy 2ZS works great - that's what was used for recording the sample audio files.)
There's also a recommendation to make sure you're in a quiet environment in order to get best results. I'm not exactly sure what qualifies as quiet, but my living room with an ambient noise level of 50 to 60 dB doesn't appear to present any difficulties. On the other hand, I did try using speech recognition in a data center that had an ambient noise level of over 75 dB. I received a warning that the noise level was too high during the microphone configuration process, but I did have the option to continue. I did so, but within minutes I had to admit defeat. I could either shout at my microphone, or else I could try to get by with less than 50% accuracy rates.
I would say there's a reasonable chance that active noise cancellation microphones could handle such an environment, but I didn't have a need to actually make that investment. In fact, several SRS experts have emailed me and recommended ANC for exactly that sort of situation. It should allow you to use speech recognition even in noisy environments, and it can also help improve accuracy even in quieter environments. If you're serious about using such software, it's worth a look. However, few locations are as noisy as a data center -- trade shows certainly are close -- so most people won't have to worry about excessive noise. Most office environments should be fine, provided you don't mind people overhearing you.
The one area that continues to plague DNS in terms of recognition errors is acronyms. Some do fine, but there are many that continue to be recognized incorrectly, even after multiple attempts to train the software. For example, "Athlon 64 ex 240 800+" is what I get every time I want "Athlon 64 X2 4800+" -- that's when I say "Athlon 64 X2 forty-eight-hundred plus-sign". I can normally get the proper text if I say "Athlon 64 X2 four-thousand eight-hundred plus-sign", but I still frequently get "ask to" or "ax to" instead of "X2". My recent laptop review also generated a lot of "and/end 1710" instead of "M1710", despite my best attempts to train the software. There's a solution for this: creating custom "macros" for acronyms you use a lot. The only difficulty is that you have to spend the time initially to set up the macros, but for long-term use its definitely recommended.
My final comment for now is that the functionality provided by Dragon NaturallySpeaking is far better in Word or DragonPad (the integrated rich text editor that comes with DNS) than in most other applications. If you're using Word or DragonPad, it's relatively simple to go back and correct errors without touching the keyboard. All you have to do is say "select XXX" where XXX is a word or phrase that occurs in the document. DNS will generally select the nearest occurrence of that phrase, and you're presented with a list of choices for correcting it, or you can just speak the new text you want to replace the original text. This is one of those intuitive things I was talking about that Microsoft currently lacks.
There are a few problems with this system. The biggest is that outside of Word and DragonPad, the instant you touch the keyboard or switch to another application, DNS loses all knowledge about any of the previous text. This happens a lot with web browsers and instant messaging clients -- I surf almost exclusively with Firefox, so I can't say whether or not this holds true for Internet Explorer. Another problem is that sometimes the selection gets off by one character, so you end up deleting the last character of the previous word and getting an extra space on the other side. (This only happens outside of Word/DragonPad, as far as I can tell.) I've also had a few occasions where the system goes into "slow motion" when I try to make a correction: the text will start to be selected one character at a time, at a rate of about two characters per second, and then once all the text is selected it has to be unselected again one character at a time. If I'm trying to select a larger phrase, I'm best off just walking away from the computer for a few minutes. (Screaming and yelling at my system doesn't help, unfortunately.) Thankfully, both of those glitches only occur rarely.
Hopefully, some of you will have found this article to be an interesting look at a technology that has continued to improve through the years. It's still not perfect, but speech recognition software has become a regular part of my daily routine. There are certainly people out there who type more than I do, and I would definitely recommend that many of them take a look at speech recognition software. If you happen to be experiencing some RSI/carpal tunnel issues caused by typing, that recommendation increases a hundredfold. I'm certainly no doctor, but the expression "No pain, no gain" isn't always true; some types of pain are your body's way of telling you to knock it off.
If you have any further questions about my experience with speech recognition, send them my way. I don't think I'm a bona fide expert on the subject, but I'll be happy to offer some help if I can.
At first glance, both of these speech recognition packages appear pretty reasonable. Dragon is more accurate in transcribe mode, but it also requires more processing time. Both also manage to offer better than 90% accuracy, but as stated earlier that really isn't that great. Having used speech-recognition for several months now, I would say that 95% accuracy is the bare minimum you want to achieve, and more is better. If you already have Microsoft Office 2003, the performance offered might be enough to keep you happy. I can't say that I would be happy with it, unfortunately.
It may simply be that I started with Dragon NaturallySpeaking, but so far every time I've tried to use Microsoft's speech tool, I've been frustrated with the interface. Microsoft does appear to do better when you start speaking rapidly, but I generally only speak in a clause or at most a sentence at a time, and the Microsoft speech engine doesn't seem to do as well with that style of delivery.
Going back to my earlier analogy of the mouse wheel, I can't help but feel that it's something of the same thing. Having experienced the way Dragon does things, the current Microsoft interface is a poor substitute. It almost feels as though the utility reached the point where it was "good enough" and has failed to progress from there. The accuracy is fine, but dealing with the errors and training the tool to properly recognize words for future use is unintuitive at best - I didn't even have to crack the manual for DNS until I wanted to get into some specialized commands! That said, Windows Vista and the new Office Live are rumored to have better speech support, so I will definitely look into those in the future.
In case you were wondering, the vast majority of this article was written using Dragon NaturallySpeaking. There are still many errors being made by the software, but there are far more errors being made by the user. Basically, the software works best if you can think and speak in complete phrases/clauses, as that gives the software a better chance to recognize words from context. Any stuttering, slight pauses, slurring, etc. can dramatically impact the accuracy. It's also important to enunciate your words -- it will certainly help improve the accuracy if you can do so.
I find DNS works very well for my purposes, but I'm sure there are people out there that will find it less than ideal. There are also various language packs available, including several dialects of English, but I can't say how well any of them work from personal experience. I don't really feel that I can say a lot about getting the most out of speech recognition from Microsoft, so the remainder of my comments come from my use of Dragon NaturallySpeaking.
Thoughts on Dragon NaturallySpeaking
Note: I've added some additional commentary on the subject based on email conversations.
One of the questions that many people have is what's the best microphone setup to use? I use a Plantronics headset that I picked up online for about $30, and it hasn't given me any cause for concern. The soft padded earphones are definitely a plus if you're going to use your headset for long periods of time. I've also used a Logitech headset, and it worked fine, but it's not as comfortable for extended use. The location of the microphone -- somewhat closer to your ear than to your mouth -- also seems to make it less appropriate for noisy environments. There are nicer microphones out there, including models that feature active noise cancellation (ANC), and they could conceivably help -- especially in noisy environments. So far, I at least have not found them to be necessary for my needs.
As far as sound hardware goes, I've used integrated Realtek ALC655 audio, integrated Creative Live! 24-bit, and a discrete Creative Audigy 2 ZS card. The ALC655 definitely has some static and popping noises, but it didn't seem to affect speech recognition in any way that I could see. The quality of integrated audio varies by motherboard, of course, but I would suggest you try out whenever you currently have first. Whatever sound card/chipset you're using ought to be sufficient; if it's not, you could try a USB sound pod or upgrade to a nicer sound card that has a better signal to noise ratio. (My Audigy 2ZS works great - that's what was used for recording the sample audio files.)
There's also a recommendation to make sure you're in a quiet environment in order to get best results. I'm not exactly sure what qualifies as quiet, but my living room with an ambient noise level of 50 to 60 dB doesn't appear to present any difficulties. On the other hand, I did try using speech recognition in a data center that had an ambient noise level of over 75 dB. I received a warning that the noise level was too high during the microphone configuration process, but I did have the option to continue. I did so, but within minutes I had to admit defeat. I could either shout at my microphone, or else I could try to get by with less than 50% accuracy rates.
I would say there's a reasonable chance that active noise cancellation microphones could handle such an environment, but I didn't have a need to actually make that investment. In fact, several SRS experts have emailed me and recommended ANC for exactly that sort of situation. It should allow you to use speech recognition even in noisy environments, and it can also help improve accuracy even in quieter environments. If you're serious about using such software, it's worth a look. However, few locations are as noisy as a data center -- trade shows certainly are close -- so most people won't have to worry about excessive noise. Most office environments should be fine, provided you don't mind people overhearing you.
The one area that continues to plague DNS in terms of recognition errors is acronyms. Some do fine, but there are many that continue to be recognized incorrectly, even after multiple attempts to train the software. For example, "Athlon 64 ex 240 800+" is what I get every time I want "Athlon 64 X2 4800+" -- that's when I say "Athlon 64 X2 forty-eight-hundred plus-sign". I can normally get the proper text if I say "Athlon 64 X2 four-thousand eight-hundred plus-sign", but I still frequently get "ask to" or "ax to" instead of "X2". My recent laptop review also generated a lot of "and/end 1710" instead of "M1710", despite my best attempts to train the software. There's a solution for this: creating custom "macros" for acronyms you use a lot. The only difficulty is that you have to spend the time initially to set up the macros, but for long-term use its definitely recommended.
My final comment for now is that the functionality provided by Dragon NaturallySpeaking is far better in Word or DragonPad (the integrated rich text editor that comes with DNS) than in most other applications. If you're using Word or DragonPad, it's relatively simple to go back and correct errors without touching the keyboard. All you have to do is say "select XXX" where XXX is a word or phrase that occurs in the document. DNS will generally select the nearest occurrence of that phrase, and you're presented with a list of choices for correcting it, or you can just speak the new text you want to replace the original text. This is one of those intuitive things I was talking about that Microsoft currently lacks.
There are a few problems with this system. The biggest is that outside of Word and DragonPad, the instant you touch the keyboard or switch to another application, DNS loses all knowledge about any of the previous text. This happens a lot with web browsers and instant messaging clients -- I surf almost exclusively with Firefox, so I can't say whether or not this holds true for Internet Explorer. Another problem is that sometimes the selection gets off by one character, so you end up deleting the last character of the previous word and getting an extra space on the other side. (This only happens outside of Word/DragonPad, as far as I can tell.) I've also had a few occasions where the system goes into "slow motion" when I try to make a correction: the text will start to be selected one character at a time, at a rate of about two characters per second, and then once all the text is selected it has to be unselected again one character at a time. If I'm trying to select a larger phrase, I'm best off just walking away from the computer for a few minutes. (Screaming and yelling at my system doesn't help, unfortunately.) Thankfully, both of those glitches only occur rarely.
Hopefully, some of you will have found this article to be an interesting look at a technology that has continued to improve through the years. It's still not perfect, but speech recognition software has become a regular part of my daily routine. There are certainly people out there who type more than I do, and I would definitely recommend that many of them take a look at speech recognition software. If you happen to be experiencing some RSI/carpal tunnel issues caused by typing, that recommendation increases a hundredfold. I'm certainly no doctor, but the expression "No pain, no gain" isn't always true; some types of pain are your body's way of telling you to knock it off.
If you have any further questions about my experience with speech recognition, send them my way. I don't think I'm a bona fide expert on the subject, but I'll be happy to offer some help if I can.
38 Comments
View All Comments
FrankyJunior - Sunday, April 30, 2006 - link
For anyone that wants to try Dragon, I just noticed that the preferred version is in the CompUSA ad today for $99.Never would have looked twice at it if I hadn't read this article yesterday.
NullSubroutine - Thursday, April 27, 2006 - link
are we to the day when i say 'computer' and it does what i want, and when i time travel by going around the sun ill be confused when they hand me a mouse and keyboard when wanting to use a computer?JarredWalton - Thursday, April 27, 2006 - link
Almost. And if you go around the sun *backwards* you can travel through time in the other direction. :Dquanta - Tuesday, April 25, 2006 - link
How about a review based on http://www.voicebox.com">VoiceBox Tehnologies products? It was demonstrated on Discovery Channel, and it seems to work without extensive voice training, and it actually _understand_ human speeches. The Discovery Channel can be found in http://www.exn.ca/dailyplanet/view.asp?date=3/13/2...">here.rico - Tuesday, April 25, 2006 - link
Where did you find Dragon Pro for $160? I thought it ususally cost about $800. Thanks.JarredWalton - Tuesday, April 25, 2006 - link
Heh, sorry - got "Preferred" and "Professional" mixed up. I'm not entirely sure what Pro includes, i.e. "Comes with a full set of network deployment tools."Trying to surf through Nuance's site is a bit tricky, and finding prices takes some effort as well. I think the only difference between Standard and Preferred is the ability to transcribe recordings in preferred - can anyone confirm for sure? I asked Nuance and didn't get a reply.
Tabah - Sunday, April 23, 2006 - link
Excellent article/review. Here's the question I've been wondering. Personally I use DNS for blogging and generally anything that requires excessive typing. A friend of mine on the other hand swears by IBM ViaVoice. Any chance we could get a comparison article/review at a later date?JarredWalton - Tuesday, April 25, 2006 - link
I will try to get in touch with IBM. I'm sure they wouldn't mind participating in a follow-up article.Tabah - Tuesday, April 25, 2006 - link
Oddly enough ViaVoice is licensed by Nuance so you might have a better chance talking to them. The main reason I'd like to see a comparison between VV and DNS isn't so much because they're made/released by the same company, but because off the cost difference between them. Like I said before I really like DNS but VV at the high end (VV Pro USB vs DNS Pro) is still a few hundred dollars cheaper.Poser - Sunday, April 23, 2006 - link
Listening to the dictation files, I was amazed that all the punctuation was spoken. I would have expected that they would (or could) be replaced by using a non-speech sound. Something along the lines of a click of the tongue for a comma -- there's a good number of distinct sounds you can make with your tongue that we don't have words for but that anyone could recognize and make. Think of "The Gods Must be Crazy" and the language used by the Kalahari bushmen for an extreme example.Also, thanks for the article, it was really interesting and potentially very helpful! I'll hold off until Vista hits and I see some comparisons, but I'm certain now that I'll end up using one of the two.