Speech Recognition - Ready for Prime Time?
by Jarred Walton on April 21, 2006 9:00 AM EST- Posted in
- Smartphones
- Mobile
Closing Thoughts
At first glance, both of these speech recognition packages appear pretty reasonable. Dragon is more accurate in transcribe mode, but it also requires more processing time. Both also manage to offer better than 90% accuracy, but as stated earlier that really isn't that great. Having used speech-recognition for several months now, I would say that 95% accuracy is the bare minimum you want to achieve, and more is better. If you already have Microsoft Office 2003, the performance offered might be enough to keep you happy. I can't say that I would be happy with it, unfortunately.
It may simply be that I started with Dragon NaturallySpeaking, but so far every time I've tried to use Microsoft's speech tool, I've been frustrated with the interface. Microsoft does appear to do better when you start speaking rapidly, but I generally only speak in a clause or at most a sentence at a time, and the Microsoft speech engine doesn't seem to do as well with that style of delivery.
Going back to my earlier analogy of the mouse wheel, I can't help but feel that it's something of the same thing. Having experienced the way Dragon does things, the current Microsoft interface is a poor substitute. It almost feels as though the utility reached the point where it was "good enough" and has failed to progress from there. The accuracy is fine, but dealing with the errors and training the tool to properly recognize words for future use is unintuitive at best - I didn't even have to crack the manual for DNS until I wanted to get into some specialized commands! That said, Windows Vista and the new Office Live are rumored to have better speech support, so I will definitely look into those in the future.
In case you were wondering, the vast majority of this article was written using Dragon NaturallySpeaking. There are still many errors being made by the software, but there are far more errors being made by the user. Basically, the software works best if you can think and speak in complete phrases/clauses, as that gives the software a better chance to recognize words from context. Any stuttering, slight pauses, slurring, etc. can dramatically impact the accuracy. It's also important to enunciate your words -- it will certainly help improve the accuracy if you can do so.
I find DNS works very well for my purposes, but I'm sure there are people out there that will find it less than ideal. There are also various language packs available, including several dialects of English, but I can't say how well any of them work from personal experience. I don't really feel that I can say a lot about getting the most out of speech recognition from Microsoft, so the remainder of my comments come from my use of Dragon NaturallySpeaking.
Thoughts on Dragon NaturallySpeaking
Note: I've added some additional commentary on the subject based on email conversations.
One of the questions that many people have is what's the best microphone setup to use? I use a Plantronics headset that I picked up online for about $30, and it hasn't given me any cause for concern. The soft padded earphones are definitely a plus if you're going to use your headset for long periods of time. I've also used a Logitech headset, and it worked fine, but it's not as comfortable for extended use. The location of the microphone -- somewhat closer to your ear than to your mouth -- also seems to make it less appropriate for noisy environments. There are nicer microphones out there, including models that feature active noise cancellation (ANC), and they could conceivably help -- especially in noisy environments. So far, I at least have not found them to be necessary for my needs.
As far as sound hardware goes, I've used integrated Realtek ALC655 audio, integrated Creative Live! 24-bit, and a discrete Creative Audigy 2 ZS card. The ALC655 definitely has some static and popping noises, but it didn't seem to affect speech recognition in any way that I could see. The quality of integrated audio varies by motherboard, of course, but I would suggest you try out whenever you currently have first. Whatever sound card/chipset you're using ought to be sufficient; if it's not, you could try a USB sound pod or upgrade to a nicer sound card that has a better signal to noise ratio. (My Audigy 2ZS works great - that's what was used for recording the sample audio files.)
There's also a recommendation to make sure you're in a quiet environment in order to get best results. I'm not exactly sure what qualifies as quiet, but my living room with an ambient noise level of 50 to 60 dB doesn't appear to present any difficulties. On the other hand, I did try using speech recognition in a data center that had an ambient noise level of over 75 dB. I received a warning that the noise level was too high during the microphone configuration process, but I did have the option to continue. I did so, but within minutes I had to admit defeat. I could either shout at my microphone, or else I could try to get by with less than 50% accuracy rates.
I would say there's a reasonable chance that active noise cancellation microphones could handle such an environment, but I didn't have a need to actually make that investment. In fact, several SRS experts have emailed me and recommended ANC for exactly that sort of situation. It should allow you to use speech recognition even in noisy environments, and it can also help improve accuracy even in quieter environments. If you're serious about using such software, it's worth a look. However, few locations are as noisy as a data center -- trade shows certainly are close -- so most people won't have to worry about excessive noise. Most office environments should be fine, provided you don't mind people overhearing you.
The one area that continues to plague DNS in terms of recognition errors is acronyms. Some do fine, but there are many that continue to be recognized incorrectly, even after multiple attempts to train the software. For example, "Athlon 64 ex 240 800+" is what I get every time I want "Athlon 64 X2 4800+" -- that's when I say "Athlon 64 X2 forty-eight-hundred plus-sign". I can normally get the proper text if I say "Athlon 64 X2 four-thousand eight-hundred plus-sign", but I still frequently get "ask to" or "ax to" instead of "X2". My recent laptop review also generated a lot of "and/end 1710" instead of "M1710", despite my best attempts to train the software. There's a solution for this: creating custom "macros" for acronyms you use a lot. The only difficulty is that you have to spend the time initially to set up the macros, but for long-term use its definitely recommended.
My final comment for now is that the functionality provided by Dragon NaturallySpeaking is far better in Word or DragonPad (the integrated rich text editor that comes with DNS) than in most other applications. If you're using Word or DragonPad, it's relatively simple to go back and correct errors without touching the keyboard. All you have to do is say "select XXX" where XXX is a word or phrase that occurs in the document. DNS will generally select the nearest occurrence of that phrase, and you're presented with a list of choices for correcting it, or you can just speak the new text you want to replace the original text. This is one of those intuitive things I was talking about that Microsoft currently lacks.
There are a few problems with this system. The biggest is that outside of Word and DragonPad, the instant you touch the keyboard or switch to another application, DNS loses all knowledge about any of the previous text. This happens a lot with web browsers and instant messaging clients -- I surf almost exclusively with Firefox, so I can't say whether or not this holds true for Internet Explorer. Another problem is that sometimes the selection gets off by one character, so you end up deleting the last character of the previous word and getting an extra space on the other side. (This only happens outside of Word/DragonPad, as far as I can tell.) I've also had a few occasions where the system goes into "slow motion" when I try to make a correction: the text will start to be selected one character at a time, at a rate of about two characters per second, and then once all the text is selected it has to be unselected again one character at a time. If I'm trying to select a larger phrase, I'm best off just walking away from the computer for a few minutes. (Screaming and yelling at my system doesn't help, unfortunately.) Thankfully, both of those glitches only occur rarely.
Hopefully, some of you will have found this article to be an interesting look at a technology that has continued to improve through the years. It's still not perfect, but speech recognition software has become a regular part of my daily routine. There are certainly people out there who type more than I do, and I would definitely recommend that many of them take a look at speech recognition software. If you happen to be experiencing some RSI/carpal tunnel issues caused by typing, that recommendation increases a hundredfold. I'm certainly no doctor, but the expression "No pain, no gain" isn't always true; some types of pain are your body's way of telling you to knock it off.
If you have any further questions about my experience with speech recognition, send them my way. I don't think I'm a bona fide expert on the subject, but I'll be happy to offer some help if I can.
At first glance, both of these speech recognition packages appear pretty reasonable. Dragon is more accurate in transcribe mode, but it also requires more processing time. Both also manage to offer better than 90% accuracy, but as stated earlier that really isn't that great. Having used speech-recognition for several months now, I would say that 95% accuracy is the bare minimum you want to achieve, and more is better. If you already have Microsoft Office 2003, the performance offered might be enough to keep you happy. I can't say that I would be happy with it, unfortunately.
It may simply be that I started with Dragon NaturallySpeaking, but so far every time I've tried to use Microsoft's speech tool, I've been frustrated with the interface. Microsoft does appear to do better when you start speaking rapidly, but I generally only speak in a clause or at most a sentence at a time, and the Microsoft speech engine doesn't seem to do as well with that style of delivery.
Going back to my earlier analogy of the mouse wheel, I can't help but feel that it's something of the same thing. Having experienced the way Dragon does things, the current Microsoft interface is a poor substitute. It almost feels as though the utility reached the point where it was "good enough" and has failed to progress from there. The accuracy is fine, but dealing with the errors and training the tool to properly recognize words for future use is unintuitive at best - I didn't even have to crack the manual for DNS until I wanted to get into some specialized commands! That said, Windows Vista and the new Office Live are rumored to have better speech support, so I will definitely look into those in the future.
In case you were wondering, the vast majority of this article was written using Dragon NaturallySpeaking. There are still many errors being made by the software, but there are far more errors being made by the user. Basically, the software works best if you can think and speak in complete phrases/clauses, as that gives the software a better chance to recognize words from context. Any stuttering, slight pauses, slurring, etc. can dramatically impact the accuracy. It's also important to enunciate your words -- it will certainly help improve the accuracy if you can do so.
I find DNS works very well for my purposes, but I'm sure there are people out there that will find it less than ideal. There are also various language packs available, including several dialects of English, but I can't say how well any of them work from personal experience. I don't really feel that I can say a lot about getting the most out of speech recognition from Microsoft, so the remainder of my comments come from my use of Dragon NaturallySpeaking.
Thoughts on Dragon NaturallySpeaking
Note: I've added some additional commentary on the subject based on email conversations.
One of the questions that many people have is what's the best microphone setup to use? I use a Plantronics headset that I picked up online for about $30, and it hasn't given me any cause for concern. The soft padded earphones are definitely a plus if you're going to use your headset for long periods of time. I've also used a Logitech headset, and it worked fine, but it's not as comfortable for extended use. The location of the microphone -- somewhat closer to your ear than to your mouth -- also seems to make it less appropriate for noisy environments. There are nicer microphones out there, including models that feature active noise cancellation (ANC), and they could conceivably help -- especially in noisy environments. So far, I at least have not found them to be necessary for my needs.
As far as sound hardware goes, I've used integrated Realtek ALC655 audio, integrated Creative Live! 24-bit, and a discrete Creative Audigy 2 ZS card. The ALC655 definitely has some static and popping noises, but it didn't seem to affect speech recognition in any way that I could see. The quality of integrated audio varies by motherboard, of course, but I would suggest you try out whenever you currently have first. Whatever sound card/chipset you're using ought to be sufficient; if it's not, you could try a USB sound pod or upgrade to a nicer sound card that has a better signal to noise ratio. (My Audigy 2ZS works great - that's what was used for recording the sample audio files.)
There's also a recommendation to make sure you're in a quiet environment in order to get best results. I'm not exactly sure what qualifies as quiet, but my living room with an ambient noise level of 50 to 60 dB doesn't appear to present any difficulties. On the other hand, I did try using speech recognition in a data center that had an ambient noise level of over 75 dB. I received a warning that the noise level was too high during the microphone configuration process, but I did have the option to continue. I did so, but within minutes I had to admit defeat. I could either shout at my microphone, or else I could try to get by with less than 50% accuracy rates.
I would say there's a reasonable chance that active noise cancellation microphones could handle such an environment, but I didn't have a need to actually make that investment. In fact, several SRS experts have emailed me and recommended ANC for exactly that sort of situation. It should allow you to use speech recognition even in noisy environments, and it can also help improve accuracy even in quieter environments. If you're serious about using such software, it's worth a look. However, few locations are as noisy as a data center -- trade shows certainly are close -- so most people won't have to worry about excessive noise. Most office environments should be fine, provided you don't mind people overhearing you.
The one area that continues to plague DNS in terms of recognition errors is acronyms. Some do fine, but there are many that continue to be recognized incorrectly, even after multiple attempts to train the software. For example, "Athlon 64 ex 240 800+" is what I get every time I want "Athlon 64 X2 4800+" -- that's when I say "Athlon 64 X2 forty-eight-hundred plus-sign". I can normally get the proper text if I say "Athlon 64 X2 four-thousand eight-hundred plus-sign", but I still frequently get "ask to" or "ax to" instead of "X2". My recent laptop review also generated a lot of "and/end 1710" instead of "M1710", despite my best attempts to train the software. There's a solution for this: creating custom "macros" for acronyms you use a lot. The only difficulty is that you have to spend the time initially to set up the macros, but for long-term use its definitely recommended.
My final comment for now is that the functionality provided by Dragon NaturallySpeaking is far better in Word or DragonPad (the integrated rich text editor that comes with DNS) than in most other applications. If you're using Word or DragonPad, it's relatively simple to go back and correct errors without touching the keyboard. All you have to do is say "select XXX" where XXX is a word or phrase that occurs in the document. DNS will generally select the nearest occurrence of that phrase, and you're presented with a list of choices for correcting it, or you can just speak the new text you want to replace the original text. This is one of those intuitive things I was talking about that Microsoft currently lacks.
There are a few problems with this system. The biggest is that outside of Word and DragonPad, the instant you touch the keyboard or switch to another application, DNS loses all knowledge about any of the previous text. This happens a lot with web browsers and instant messaging clients -- I surf almost exclusively with Firefox, so I can't say whether or not this holds true for Internet Explorer. Another problem is that sometimes the selection gets off by one character, so you end up deleting the last character of the previous word and getting an extra space on the other side. (This only happens outside of Word/DragonPad, as far as I can tell.) I've also had a few occasions where the system goes into "slow motion" when I try to make a correction: the text will start to be selected one character at a time, at a rate of about two characters per second, and then once all the text is selected it has to be unselected again one character at a time. If I'm trying to select a larger phrase, I'm best off just walking away from the computer for a few minutes. (Screaming and yelling at my system doesn't help, unfortunately.) Thankfully, both of those glitches only occur rarely.
Hopefully, some of you will have found this article to be an interesting look at a technology that has continued to improve through the years. It's still not perfect, but speech recognition software has become a regular part of my daily routine. There are certainly people out there who type more than I do, and I would definitely recommend that many of them take a look at speech recognition software. If you happen to be experiencing some RSI/carpal tunnel issues caused by typing, that recommendation increases a hundredfold. I'm certainly no doctor, but the expression "No pain, no gain" isn't always true; some types of pain are your body's way of telling you to knock it off.
If you have any further questions about my experience with speech recognition, send them my way. I don't think I'm a bona fide expert on the subject, but I'll be happy to offer some help if I can.
38 Comments
View All Comments
JarredWalton - Friday, April 21, 2006 - link
That's definitely true -- if you look at how accuracy scales with CPU usage, doubling and even tripling the processor time comes with only incremental increases in accuracy. I do have to say that I noticed it being a little sluggish on my single core system when I was multitasking, but obviously I push my computers a little harder than a lot of people. Depending on what you're willing to live with in terms of speed, I'm sure both Dragon and Microsoft speech recognition can work on a Pentium III level system.LanceM - Friday, April 21, 2006 - link
So is that selection typical Asimov? If so, it has convinced me to never bother reading any of his works.His ideas/plots/etc. may be interesting, but I don't think I could handle phrases like, "as if she were some dried-up, old-maid teacher." Give me Joseph Conrad or William Faulkner.
Dfere - Monday, April 24, 2006 - link
Asimov is classic Sci-Fi- pulp, which usually had a gritty detective-novel appeal. Hs works are in large part murder mystery type novels. You have to understand the nature of the literature, the history and the author. I don't think a critique is deserved until then.Most Sci Fi writers of any ability first master imaginative concepts and apply them, even Drke and Sirling.
I give Kudos to the staff for including literary comments, the poster who said this should not be a book of the month club lives a very one dimensional life.
Shoal07 - Friday, April 21, 2006 - link
What makes Asimov special is many of his ideas in sci fiction are comming true today or are atleast on the horizon. Asimov shaped the way many of us picture the future.goinginstyle - Friday, April 21, 2006 - link
Why does the Anandtech staff revert to literary quotes in their reviews now? This is a computer website, not a book club.JarredWalton - Friday, April 21, 2006 - link
I read Asimov's foundation series as a teenager, and I loved it. He gave me lots of fanciful dreams about where technology might go in the future, and even though some of the writing styles have changed over the years, I still find a lot of these old sci-fi books to be entertaining. You should try reading War of the Worlds if you think that quote was bad. LOLSorry if some of you didn't like the quote. Everyone has their own dislikes and likes, but in the end it's just an introduction. I hope to one day be able to yell at my computer and have it properly understand what I say, as well as the context (i.e., yelling means something is going wrong, and maybe it can help me out). Will we ever get there? Probably some day, but whether it happens in our lifetimes or not is anyone's guess.
NegativeEntropy - Saturday, April 22, 2006 - link
I like the use of quotes -- though it does remind me a bit of being in English/writing class ("Always do something in the introduction to get your audience's attention...").On the subject of "classic" Sci-fi writers, I also still enjoy old school Heinlein. Though his characters can get a bit repetitive across his pile of works, many of the science ideas are still valid (and I share much of his apparent personal philosophy).
On the actual article -- thanks for doing it. I have been curious where this technology was at in terms of every day usage and hardware requirements.
Regarding CPU usage, it's possible DNS attempts to use whatever resources are available based on preferences. i.e. on minimum, it attempts to impact the system minimally, regardless of the CPU resources available; say 25% on min, 50% on med and 95% on max with the percentage staying relatively consistent from a P3 1GHz to an A64 2.6GHz. This would explain its reported good scaling from system to system. If you want to test it, underclock your A64 system to half its frequency and compare utilization at the medium setting.
kristof007 - Friday, April 21, 2006 - link
Here at Anandtech you can always count on to find something else. Great article! I tried out speech recognition a few years back and I got frustrated with it over one thing or another so I just dropped it and went back to typing. I've been typing for about 8 years now. I never learned the "proper" way to type where every finger has a spot. Anyway I hope Vista will make speech recognition WAAY better so that it could be used around the OS AND for speech recognition.Thanks for the article!