Speech Recognition - Ready for Prime Time?
by Jarred Walton on April 21, 2006 9:00 AM EST- Posted in
- Smartphones
- Mobile
Processor Utilization - Precise Dictation
Now that we have some idea of the accuracy these solutions offer in terms of accuracy, what sort of CPU requirements are we looking at? As with our accuracy charts, we've got a separate section looking at the processor utilization of Dragon NaturallySpeaking when transcribing a WAV file. Below are screenshots of Windows Task Manager showing CPU usage during dictation. In retrospect, finding a utility to track average CPU utilization over time would have been more useful, but these screenshots should suffice for our purposes.
Dictation Processor Utilization
One thing is immediately clear: Dragon NaturallySpeaking requires far more CPU processing time than Microsoft Office. Even at the lowest accuracy setting, Dragon essentially matches the CPU usage of Microsoft's tool at its maximum accuracy setting. However, CPU usage and accuracy are only two of the aspects of this software package, and that much more difficult to describe "user experience" continues to be far preferable to me with Dragon NaturallySpeaking.
The second major point of interest is that having a second processor core does absolutely nothing for these speech recognition packages. (MS might even be able to run without difficulty on a Pentium 3, judging by the CPU usage.) Sure, if you're running multiple applications that are all trying to use the CPU, the second core can be useful. On the other hand, if the only thing you're doing is dictating speech, the current algorithms are clearly single threaded in nature. Given that accurate speech recognition depends in large part on recognizing the context of sounds -- this is especially true for homonyms like their, they're, and there -- there may be some difficulty associated with breaking the task into meaningful, discrete parts. However, difficult does not mean impossible, and with AMD, Intel, and all the other major CPU players moving towards multiple cores, further improvements in accuracy are likely going to require multithreaded algorithms.
Transcription Processor Utilization
As with dictating, transcribing an audio file also fails to benefit from multiple CPU cores. The good news is that processing times are much faster, because a single CPU core can chew through the waveforms as fast as possible. While the maximum accuracy mode didn't seem to do all that well with dictating, it did seem to handle a few phrases better when transcribing. It also takes longer, but if you're in a situation where you can start the transcription process and walk away for awhile, that shouldn't matter too much.
Now that we have some idea of the accuracy these solutions offer in terms of accuracy, what sort of CPU requirements are we looking at? As with our accuracy charts, we've got a separate section looking at the processor utilization of Dragon NaturallySpeaking when transcribing a WAV file. Below are screenshots of Windows Task Manager showing CPU usage during dictation. In retrospect, finding a utility to track average CPU utilization over time would have been more useful, but these screenshots should suffice for our purposes.
Dictation Processor Utilization
DNS8 Maximum Accuracy |
DNS8 Medium Accuracy |
DNS8 Minimum Accuracy |
MSWord Maximum Accuracy |
MSWord Medium Accuracy |
MSWord Minimum Accuracy |
One thing is immediately clear: Dragon NaturallySpeaking requires far more CPU processing time than Microsoft Office. Even at the lowest accuracy setting, Dragon essentially matches the CPU usage of Microsoft's tool at its maximum accuracy setting. However, CPU usage and accuracy are only two of the aspects of this software package, and that much more difficult to describe "user experience" continues to be far preferable to me with Dragon NaturallySpeaking.
The second major point of interest is that having a second processor core does absolutely nothing for these speech recognition packages. (MS might even be able to run without difficulty on a Pentium 3, judging by the CPU usage.) Sure, if you're running multiple applications that are all trying to use the CPU, the second core can be useful. On the other hand, if the only thing you're doing is dictating speech, the current algorithms are clearly single threaded in nature. Given that accurate speech recognition depends in large part on recognizing the context of sounds -- this is especially true for homonyms like their, they're, and there -- there may be some difficulty associated with breaking the task into meaningful, discrete parts. However, difficult does not mean impossible, and with AMD, Intel, and all the other major CPU players moving towards multiple cores, further improvements in accuracy are likely going to require multithreaded algorithms.
Transcription Processor Utilization
DNS8 Maximum Accuracy |
DNS8 Medium Accuracy |
DNS8 Minimum Accuracy |
As with dictating, transcribing an audio file also fails to benefit from multiple CPU cores. The good news is that processing times are much faster, because a single CPU core can chew through the waveforms as fast as possible. While the maximum accuracy mode didn't seem to do all that well with dictating, it did seem to handle a few phrases better when transcribing. It also takes longer, but if you're in a situation where you can start the transcription process and walk away for awhile, that shouldn't matter too much.
38 Comments
View All Comments
FrankyJunior - Sunday, April 30, 2006 - link
For anyone that wants to try Dragon, I just noticed that the preferred version is in the CompUSA ad today for $99.Never would have looked twice at it if I hadn't read this article yesterday.
NullSubroutine - Thursday, April 27, 2006 - link
are we to the day when i say 'computer' and it does what i want, and when i time travel by going around the sun ill be confused when they hand me a mouse and keyboard when wanting to use a computer?JarredWalton - Thursday, April 27, 2006 - link
Almost. And if you go around the sun *backwards* you can travel through time in the other direction. :Dquanta - Tuesday, April 25, 2006 - link
How about a review based on http://www.voicebox.com">VoiceBox Tehnologies products? It was demonstrated on Discovery Channel, and it seems to work without extensive voice training, and it actually _understand_ human speeches. The Discovery Channel can be found in http://www.exn.ca/dailyplanet/view.asp?date=3/13/2...">here.rico - Tuesday, April 25, 2006 - link
Where did you find Dragon Pro for $160? I thought it ususally cost about $800. Thanks.JarredWalton - Tuesday, April 25, 2006 - link
Heh, sorry - got "Preferred" and "Professional" mixed up. I'm not entirely sure what Pro includes, i.e. "Comes with a full set of network deployment tools."Trying to surf through Nuance's site is a bit tricky, and finding prices takes some effort as well. I think the only difference between Standard and Preferred is the ability to transcribe recordings in preferred - can anyone confirm for sure? I asked Nuance and didn't get a reply.
Tabah - Sunday, April 23, 2006 - link
Excellent article/review. Here's the question I've been wondering. Personally I use DNS for blogging and generally anything that requires excessive typing. A friend of mine on the other hand swears by IBM ViaVoice. Any chance we could get a comparison article/review at a later date?JarredWalton - Tuesday, April 25, 2006 - link
I will try to get in touch with IBM. I'm sure they wouldn't mind participating in a follow-up article.Tabah - Tuesday, April 25, 2006 - link
Oddly enough ViaVoice is licensed by Nuance so you might have a better chance talking to them. The main reason I'd like to see a comparison between VV and DNS isn't so much because they're made/released by the same company, but because off the cost difference between them. Like I said before I really like DNS but VV at the high end (VV Pro USB vs DNS Pro) is still a few hundred dollars cheaper.Poser - Sunday, April 23, 2006 - link
Listening to the dictation files, I was amazed that all the punctuation was spoken. I would have expected that they would (or could) be replaced by using a non-speech sound. Something along the lines of a click of the tongue for a comma -- there's a good number of distinct sounds you can make with your tongue that we don't have words for but that anyone could recognize and make. Think of "The Gods Must be Crazy" and the language used by the Kalahari bushmen for an extreme example.Also, thanks for the article, it was really interesting and potentially very helpful! I'll hold off until Vista hits and I see some comparisons, but I'm certain now that I'll end up using one of the two.