Click to open or copy the CLOCK project files.
In the beginning, humans communicated with their computers using
soldering irons and voltmeters. Needless to say, this grew tiresome
quickly. So someone had the bright idea of using toggle switches and
light bulbs. But that wasn't so hot either, so soon scientists figured
out a way to feed their computers instructions on little cards with
holes in them; the computers spat their own cards out the other end.
Still pretty awkward. Things started really cooking when keyboards and
monitors came along. Now people were communicating in a strange dialect
with words like mv and grep. The enter key meant "do it." And if you
wanted to read the results in the bathroom, you could print them on a
big old clunky lineprinter.
These days, the march to make computers communicate in ways that come
naturally to humans continues. In the quest for a perfectly transparent
user interface, speech is perhaps the final frontier-short of direct
brain-link.
Admit it, since you were a kid you wanted to talk to a computer the way Mr. Spock talks to his computer aboard the Enterprise.
"Computer, what time is it?"
"Ten fifteen."
"Shoot, I'm late for my pon farr. Hey, print off the latest science officer data for me, wouldja?"
"Sure thing, Spock. You want that by animal or mineral?"
If that sounds far-fetched, keep reading. In this article, I'll bring
you up to speed on what's happening with computer speech, and I'll show
you how to write a simple talking clock program that speaks the current
time of day whenever you ask, "What time is it?" Really!
Wherefore Speech?
I know what you're
thinking. "Why would I want to talk to my computer?" "Why would I want
my computer talking to me?" You imagine a cacophony of computers and
people gabbing away in their cubicles. You think how silly you'd feel
sitting in your home office, talking to a beige box on your desk.
Well, it's true that keyboards and mice are in little danger of
becoming obsolete any time soon, but there are nevertheless many
situations where speech is useful. Have you ever played a computer game
where a character asks you a question? A cartoon-style text balloon pops
out of the character's mouth and you answer by clicking a button.
Wouldn't it be more natural if the character really spoke? And for you
to answer back in English? Or French, if that's your preference.
Or how about this. Your screen is littered with toolbars, and you
can't remember whether it's Ctrl-Alt-F8 to double-underline, or
Alt-Shift-F4 or Alt-Control-Shift-Mumble-Whatever. Why not just select
some text and say, "Double underline this." You wouldn't have to shout;
you could say it softly.
Or maybe you'd like to call your bank and transfer some money from
savings to checking. Instead of playing twenty questions with the
synthetic phone operator as you maneuver through seven levels of
prerecorded menus, why not just say, "Transfer one hundred dollars from
savings to checking?" Of course, in that case, you'd be talking to the
bank's computer, not yours-but you might want to call your own PC to
ask, "Do I have any email?" or "Look up Mary Smith's number in my
address book." No fussing with buttons while you're driving in traffic;
no need for a laptop or modem. Just dial up and talk!
Or, if you're one of the millions of people who suffer from
repetitive motion injuries like carpal tunnel syndrome, why not give
your fingers a break once in a while. Don't type, dictate. There are
other "hands off" situations where people need to use their computers
while doing something else like operating a piece of machinery. Or maybe
you just want your computer to read words or numbers back to you as you
type them, to help catch typing errors. These are just a few areas
where computer speech is really useful.
How Do They Do That?
You don't need to
understand the intricacies of speech technology to use it in your apps,
but I suspect many of you are curious, so I figured I'd give you a very
short overview of how it works.
There are two basic technologies: speech recognition (SR) and speech
synthesis, depending on who is doing the talking-you or the computer.
Speech synthesis is commonly called "text-to-speech" or TTS, since the
speech is usually synthesized from text data.
Figure 1 shows the architecture of a typical text-to-speech engine.
Figure 1 Text-to-Speech Engine
The process begins when the application hands the engine a string of
text such as, "The man walked down 56th St." The text analysis module
converts numbers into words, identifies punctuation such as commas,
periods, and semicolons, converts abbreviations to words, and even
figures out how to pronounce acronyms. Some acronyms are spelled out
(MSJ) whereas others are pronounced as a word (FEMA). The sample
sentence would get converted to something like
<beginStatement>
The man walked down fifty sixth street
<endStatement>
Text analysis is quite complex because written language can be so
ambiguous. A human has no trouble pronouncing "St. John St." as "Saint
John Street," but a computer, in typically mechanical fashion, might
come up with "Street John Street" unless a clever programmer gives it
some help.
Once the text is converted to words, the engine figures out what
words should be emphasized by making them louder or longer, or giving
them a higher pitch. Other words may be deemphasized. Without word
emphasis, or "prosody," the result is a monotone voice that sounds
robotic, like something out of a '50s sci-fi flick. After adding
prosody, thesamplesentence might end up like this:
<beginStatement>
<de-emphasize>the <emphasize>man walked
<emphasize>down fifty <emphasize>sixth street<pause>.
<endStatement>
Next, the text-to-speech engine determines how the words are
pronounced, either by looking them up in a pronunciation dictionary, or
by running an algorithm that guesses the pronunciation. Some text
strings have ambiguous pronunciations, such as "read." The engine must
use context to disambiguate the pronunciations. The result of this
analysis is the original sentence expressed as phonemes. "Th-uh
M-A-Nw-au-l-k-tD-OU-Nf-ih-f-t-eeS-IH-K-S-TH s-t-r-ee-t".
Next, the phonemes are parsed and their pronunciations retrieved from
a phoneme-to-sound database that numerically describes what the
individual phonemes sound like. If speech were simple, this table would
have only forty-four entries, one for each of the forty-four English
phonemes (or whatever language is used). In practice, each phoneme is
modified slightly by its neighbors, so the table often has as many as
1600 or more entries. Depending on the implementation, the table might
store either a short wave recording or parameters that describe the
mouth and tongue shape. Either way the sound database values are finally
smoothed together using signal processing techniques, and the digital
audio signal is sent to an output device such as a PC sound card and out
the speakers to your ears.
That's text-to-speech. Speech recognition is the flip side.
Figure 2
shows a generic speech recognition engine. When the user speaks, the
sound waves are converted into digital audio by the computer's sound
card. Typically, the audio is sampled at 11KHz and 16 bits. The raw
audio is first converted by the frequency analysis module to a more
useful format. This involves a lot of digital signal processing that's
too complicated to describe here. The basic challenge is to extract the
meaningful sound information from the raw audio data. If you were to say
the word "foo," and then say "foo" again, and look at the waveforms
generated, they would look kind of similar, but there's no way you could
compare them that will consistently recognize them as the same sound,
without applying some pretty hairy mathematical techniques using Fourier
transforms. Fortunately, people have already figured this stuff out.
Figure 2 Speech Recognition Engine
The converted audio is next broken into phonemes by a phoneme
recognition module. This module searches a sound-to-phoneme database for
the phoneme that most closely matches the sound it heard. Each database
entry contains a template that describes what a particular phoneme
sounds like. As with text-to-speech, the table typically has several
thousand entries. While the phoneme table could in theory be the same as
that used for TTS, in practice they are different because the SR and
TTS engines usually come from different vendors.
Because comparing the audio data against several thousand phonemes
takes a long time, the speech recognition engine contains a phoneme
prediction module that reduces the number of candidates by predicting
which phonemes are likely to occur in a particular context. For example,
some phonemes rarely occur at the beginning of a word, such as the "ft"
sound at the end of the word "raft." Other phonemes never occur in
pairs. In English, an "f" sound never occurs before an "s" sound. But
even with these optimizations, speech recognition still takes too long.
A word prediction database is used to further reduce the phoneme
candidate list by eliminating phonemes that don't produce valid words.
After hearing, "y eh," the recognizer will listen for "s" and "n" since
"yes" and "yen" are valid words. It will also listen for "m" in case you
say "Yemen." It will not listen for "k" since "yek" is not a valid
word. (Except in baby-talk, which is not currently supported.) The
candidate list can be reduced even further if the application stipulates
that it only expects certain words. If the app only wants to know if
the user said "yes" or "no," the phoneme recognizer needn't listen for
"n" following "y eh," even though "yen" is a word. This final stage
reduces computation immensely and makes speech recognition feasible on a
33MHz 486 or equivalent PC. Once the phonemes are recognized, they are
parsed into words, converted to text strings, and passed to the
application.
As you might imagine, both text-to-speech and speech recognition
involve quite a bit of processing, but speech recognition is harder
because it usually requires more processing for equivalent user
satisfaction. A few years ago, you needed a high-end workstation to do
speech recognition. Today, just about every new PC and even many older
PCs can handle speech. While the exact requirements vary from one speech
engine to another,
Figure 3
gives you a rough idea of the hardware needed to run various kinds of
speech applications under Windows. The faster the CPU and the more
memory available, the higher the accuracy for speech recognition and the
better the text-to-speech sounds.
Of course, you also need a sound card, microphone, and speakers. Most
speech engines will work with any sound card. Some systems offload
processing onto a DSP (digital signal processor) chip that comes on some
high-end sound cards, which cuts the CPU speed requirement in half.
Better microphones and speakers will also improve things.
As speech has become more feasible on average PCs, vendors have been
busy developing and promoting their speech engines. Many multimedia PCs
and sound cards come bundled with speech software. Others vendors sell
their engines as standalone products. Some apps even come bundled with
speech engines.
Unfortunately, as with any budding technology, the situation is a bit
chaotic. Even though they all support similar functionality, each
speech engine has its own specific features and proprietary API. If you
want to use speech in your app, you've first got to pick which engine to
use, and write your program for that engine. If a better engine comes
along, you're out of luck. You'll probably have to rewrite your program
substantially to use the other API. Proprietary APIs have stifled the
widespread adoption of speech. When faced with an irrevocable decision
about which engine to use, many developers choose not to implement
speech at all.
The Microsoft Speech API
The Microsoft®
Speech API is an attempt to correct this problem. By promoting an
industry-standard programming interface for speech, Microsoft hopes to
encourage developers to write speech-enabled apps. But I'm not here to
spout business strategies, I'm here to tell you about the API!
The Speech API lets you write Win32®-based apps (for Windows® 95 or
Windows NT™) that use speech recognition and text-to-speech. The API is
specified as a collection of OLE Component Object Model (COM) objects.
Using OLE makes speech readily available to developers writing in Visual
Basic®, C/C++, or any other programming language that can access OLE
objects directly or through automation. The Speech API requires Windows
95 or Windows NT 3.51, and since the API doesn't actually do anything,
you still need a third-party speech engine, one for SR and one for TTS.
As with other Windows Open Services Architecture (WOSA) services, the
Speech API is intended as a standard interface that application
developers and engine vendors alike can code to. Programmers can write
apps without worrying about which engine to use, engine vendors can get
instant compatibility with all speech apps, and users gain the freedom
to choose whichever speech engine meets their budget and performance
requirements. The situation is analogous to GDI, which lets programs
draw graphics without worrying about what kind of display card or
monitor the user has. Just like GDI, the Speech API provides escape
hooks to access proprietary engine features when you need to do
something special.
The Speech API offers two levels of access: high-level objects
designed to make implementation easy, and low-level objects that offer
total control but make you do a little more work. If all your program
does is listen for a few voice commands and utter some simple phrases,
you can use the high-level objects. To do more sophisticated stuff, you
need the low-level.
The high-level objects, provided by Microsoft, don't do any SR or TTS
themselves; they just call the low-level objects to do the work. The
low-level objects are provided by the speech engine vendor, just like
the video and sound card drivers that come with your display or sound
card. When your app uses the low-level API, it's talking directly to the
third-party code, bypassing Microsoft code completely (see
Figure 4).
The low-level API is too complex to describe here, so I'll focus on the
high-level objects and just give you a quick overview of the low-level
stuff.
Figure 4 Using the Low-level Speech API
Whichever you use, you'll be dealing with OLE objects.
Figures 5 and
6
show the main OLE objects and interfaces that constitute the Speech
API. Don't worry, you'll probably never need to use most of the objects
shown in
Figure 6.
The objects you're most likely to use are voice commands for speech
recognition and voice text for text-to-speech. Microsoft also provides a
speech recognition sharing object that lets several apps share engines.
Voice Commands and the Talking Clock
To show just how easy it is to write apps that talk and listen, I wrote a talking clock program (see
Figure 7)
that speaks the time and/or date whenever you ask "What time is it?" or
"What day is it?" Clock will probably seem like a ghost of an app to
you: it has no menu, and in fact it doesn't even have a window! There's
no need for either, since all it does is talk in response to verbal
commands. Of course, most speech apps will still have menus and windows
and generally look like normal apps. Clock merely demonstrates that they
don't have to.
Figure 7 Voice Commands and Menus
Clock uses the high-level SR Voice Commands object to listen for
commands from the user. The main interface, IVoiceCmd, provides
functions to do simple "command and control" speech recognition. Users
can issue simple commands like "Open the file" and answer simple yes/no
questions. For more sophisticated kinds of speech recognition such as
dictation, you'd have to use the low-level API.
Voice Commands work a lot like traditional Windows menus. You first
create a voice menu of commands you want to listen for, then you listen
for them. Pretty simple. Most programs will have one voice menu for the
main window, and one for every dialog box. When the SR engine hears a
command,it notifies the appropriate (active) app.The Voice Commands
module actually includes a few different objects. The main one is the
Voice Commands object, which provides basic functions to turn speech
recognition on or off and create voice menu objects.
The first thing Clock does is initialize OLE by calling CoInitialize.
(If you're using MFC, all you have to do is check the "Container" or
"Both container and server" check boxes when AppWizard asks what kind of
OLE support you want; AppWizard generates a call to AfxOleInit in your
app's InitInstance function.) Once OLE is initialized, Clock creates a
Voice Commands object.
CoCreateInstance(CLSID_VCmd, NULL,
CLSCTX_LOCAL_SERVER,
IID_IVoiceCmd,
(LPVOID *)&gpIVoiceCommand);
CoCreateInstance creates a local instance of the Voice Commands
object. CLSID_VCmd is the class ID. CLSCTX_LOCAL_SERVER indicates that
the object should be created on the local machine, but in a different
process from the app. The active application, such as a word processor,
can have a voice menu listening while Clock's menu is listening too. If
the user says, "Print the document," the command goes to the word
processor; if the user asks, "What time is it?" Clock gets the command.
IID_IVoiceCmd is the interface ID for the Voice Commands interface and
gpIVoiceCommand is a pointer to the this interface that's filled in by
CoCreateInstance. All the symbols you need are defined in SPEECH.H.
To actually create the object, CoCreateInstance fires up
WINDOWS\SPEECH\VCMD.EXE (if it's not already running). Some other DLLs
are used too: VCMSHL.DLL contains marshaling code, and SPEECH.DLL
contains some objects for the low-level API. Each engine also has its
own DLLs. But as far as the app and you are concerned, everything is
handled by OLE. You don't have to worry about what files are loaded, it
all happens automagically.
Before you can create a menu, you must register a notification sink.
gpIVoiceCommand->Register("",
&gVCmdNotifySink, // interface pointer(returned)
IID_IVCmdNotifySink, // interface ID
0, // high priority notifications
NULL); // VCSITEMINFO
The empty string tells the Voice Commands object to listen to the
default wave-in device, normally the microphone. Alternatively, you can
pass a string like "Line1" to listen for commands over phone line number
one. The string refers to a system registry entry that identifies the
SR engine and wave device to use. gVCmdNotifySink is the notification
sink-which I'll describe shortly-and IID_IVCmdNotifySink is the
interface ID, which identifies what kind of sink it is. Currently,
IVCmdNotifySink is the only one, but in the future others may be
supported. The 0 tells Voice Commands to send Clock only the most
important notifications. Voice Commands can notify apps when the user is
talking too loud, but Clock doesn't care about that.
Once you've registered a notification sink, you can create a voice
menu. The system supports multiple voice menus that can be independently
activated (listening) or deactivated (not listening). A CAD program
might have one voice menu with commands such as "Save the file" that are
always active, and another voice menu with commands like "Rotate 90
degrees" that are only active when something is selected. Unlike normal
Windows menus, several voice menus can be active at the same time. You
can even make a menu global, so it's still listening when your app
doesn't have the focus. Clock does this so the user can ask "What time
is it?" while working in any app.
To create a menu, you set up a couple of structures that give the
menu a name and select a language. The API supports all languages, but
the user can obviously only use the languages actually installed on the
machine. Most speech engines support English, German, French, Japanese,
Spanish, Italian, and a few others. Because it's so expensive to produce
a language, many less common languages are not yet supported by any
engine-though I'm sure that somewhere, someone with nothing better to do
is at this very moment working on one for Klingon. To create the menu,
you call IVoiceCmd's MenuCreate function.
gpIVoiceCommand->MenuCreate(
&VCmdName, // menu name
&Language, // language
VCMDMC_CREATE_TEMP, // don't archive
&gpIVCmdMenu // ptr to menu (returned)
);
VCmdName and Language identify the menu name and language;
VCMDMC_CREATE_TEMP tells the API to create a temporary menu, which will
not have its contents archived to disk. You can create permanent menus
that are saved in a database so that load times are faster, but Clock
doesn't. gpIVCmdMenu is filled with a pointer to the IVCmdMenu interface
for the new menu object. The menu starts out empty. IVCmdMenu has
methods that add, remove, and modify voice commands. For Clock, I wrote a
wrapper function, AddCommand, that bundles its arguments into a
structure and passes it to IVCmdMenu::Add.
AddCommand(gpIVCmdMenu,
"What time is it?",
IDC_WHATTIMEISIT);
AddCommand(gpIVCmdMenu,
"What day is it?",
IDC_WHATDAYISIT);
AddCommand(gpIVCmdMenu,
"Stop running Talking Clock.",
IDC_STOPRUNNING);
I added the commands one at a time, but you can add hundreds of
commands in a block if you want. Note how the commands are given as
ordinary ASCII strings-you don't have to mess with phonetic
representations or anything like that. The IDC_XXX constants identify
the commands, similar to normal menu IDs. The API imposes no limit on
the number or size of commands, but accuracy and performance will
degrade if you add more than a few hundred. To actually start listening,
Clock activates the menu:
gpIVCmdMenu->Activate(NULL, 0);
I pass NULL for the window handle to make the menu global, so Clock
listens all the time, even when another app has focus. The chances are
pretty good that no other app will be listening for any of the three
commands in Clock-but if one does, the system is smart enough to notify
it, rather than Clock. Assuming this is not the case, when the SR engine
hears "What time is it?" (or either of the other two commands), it
notifies Clock through Clock's notification sink.
In OLE, a notification sink is just a callback object that some object uses to notify your app when something happens (see
Figure 7).
Each sink has its own interface of notification functions. Clock
implements an object, CIVCmdNotifySink, that has the IVCmdNotifySink
interface (see
Figure 8).
The only notification that Clock cares about is CommandRecognize; all
the other functions have empty implementations. When the SR engine hears
"what time is it?" it calls CIVCmdNotifySink::CommandRecognize.
STDMETHODIMP
CIVCmdNotifySink::CommandRecognize(DWORD dwID,...)
{
.
.
.
switch (dwID) {
case IDC_WHATTIMEISIT:
// Speak the time (described later)...
break;
case IDC_WHATDAYISIT:
// Speak the date (described later)...
break;
case IDC_STOPRUNNING:
DestroyWindow (ghWndMain);
break;
};
return NOERROR;
}
CommandRecognize has a lot of arguments, most of which Clock doesn't
use. The only important one is the command ID, dwID. As with a
WM_COMMAND message, you do a switch on the ID. If your app has a normal
Windows menu with the same actions, you should use the same IDs. In
fact, you could even pass the notifications to your main window as a
WM_COMMAND message.
STDMETHODIMP
CIVCmdNotifySink::CommandRecognize(DWORD dwID,...)
{
SendMessage(ghWndMain, WM_COMMAND, dwID, 0);
return NOERROR;
}
If you're using MFC, you'd send the message to
AfxGetApp()->m_pMainWnd instead of ghWndMain-or perhaps you'd store a
pointer to the main window in your CIVCmdNotifySink. Of course, as with
all OLE objects, you've got to release them when you're finished.
// Release menu
if (gpIVCmdMenu)
gpIVCmdMenu->Release();
gpIVCmdMenu = NULL;
// Release Voice Commands object
if (gpIVoiceCommand)
gpIVoiceCommand->Release();
gpIVoiceCommand = NULL;
// Terminate OLE
CoUninitialize ();
This sequence appears in Clock's ShutDown function, called at the end
of WinMain as Clock is terminating. In MFC, you could release the
objects in your main window's OnDestroy handler or in your app's
ExitInstance function. With MFC, you don't have to terminate OLE; it
takes care of that for you.
That's it! Clock now recognizes your voice! Of course, it doesn't
actually do anything since I haven't added the text-to-speech part yet.
Voice Text
To make Clock talk, I need voice text, the high-level object for text-to-speech (see
Figure 9). The voice text module has only one object, the voice text object. Using it is pretty straightforward.
CoInitialize(NULL); // if you haven't done it already
CoCreateInstance(CLSID_VTxt,
NULL,
CLSCTX_LOCAL_SERVER,
IID_IVoiceText,
(LPVOID *)&gpIVTxt);
Figure 9 Voice Text
It's pretty much the same as creating a Voice Commands object; only
the IDs are different. As with Voice Commands, you must register a sink
to receive notifications:
gpIVTxt->Register("", // default wave device
gszAppName, //app name
NULL, //notify sink
IID_IVTxtNotifySink, //notifysinkIID
NULL, // flags
NULL ); // VTSITEINFO*
The empty string selects the default wave out device, normally the
sound card. You could use Line1 or some other audio output device
defined in the registry. Voice text calls IVTxtNotifySink whenever
something happens; for example, when the TTS engine starts or stops
talking, or when someone (the user or another app) changes global
attributes such as voice's volume or pitch. Clock doesn't care about any
of that nor does it even register IVTxtNotifySink, so it passes NULL
for the notification sink. But even if your sink is NULL, you still have
to register because voice text needs your app's name. That's all the
setup you need; when it's time to talk, just get the time and call
IVoiceText::Speak.
SYSTEMTIME st;
TCHAR szTemp[128];
strcpy (szTemp, "The time is ");
GetLocalTime (&st);
GetTimeFormat (0, TIME_NOSECONDS, &st, NULL,
szTemp+strlen(szTemp),sizeof(szTemp)-strlen(szTemp));
gpIVTxt->Speak( szTemp, VTXTSP_NORMAL, NULL );
The call to Speak happens asynchronously. That is, control returns
immediately; your app doesn't wait for the computer to finish speaking.
(But when it does, it can notify you through IVTxtNotifySink) Like other
Win32 API functions, IVoiceText::Speak accepts ANSI or Unicode, as
determined by the compile-time #define symbol UNICODE.
That's it. Clock now talks! If you don't believe me, grab the code
(from the usual MSJ sources) and run it yourself! Of course, you need a
sound card, speakers, a microphone, and the Speech API. I'll tell you
how to get the API at the end of the article.
Clock doesn't provide any way for the user to select or change the
sound of the computer voice. That's because it doesn't need to. Voice
text uses whatever the user has selected as the system default. Most
people want their computers to always speak with the same voice. The
voice quality (male/female, the pitch, and so on) is specified through a
Control Panel applet called Microsoft Voice, installed as part of the
Microsoft Voice setup (for more, see the sidebar). You can change the
voice programmatically if you like-games may even need several
voices-but you need the low-level API for that. My advice is to avoid it
for most apps. You don't want to annoy users who have taken the trouble
to select their ideal cybervoice. They might think their computer is
possessed.
Low-Level Grunge
Voice Commands and
voice text objects expose enough functionality to implement moderately
sophisticated speech apps. Clock uses only a few of the many functions
and features available through the high-level API. Still, there are
times when you need to do something more sophisticated, like take
dictation or use multiple voices. For that, you need the low-level API,
which lets you talk directly to the speech engine. There's not enough
room here to describe it in detail, but I can give you some idea of the
sorts of things you can do with the low-level objects.
Imagine that you're writing a transcription program that translates
an audio recording of a meeting or telephone call into text. Such a
program would need to use the low-level objects to perform dictation and
to "listen" to a wave file instead of the microphone. Here's a quick
walkthrough that explains how it might work (see
Figure 10).
Figure 10 Low-level SR Objects with Custom Audio Source
The app first determines where the audio should come from and creates
an audio source object through which to acquire digital audio data.
Microsoft supplies an audio source object that gets its audio from the
multimedia wave-in device (usually the microphone), but you can write
your own so that your app can get audio from wave files or specialized
hardware devices. The transcription app implements a custom audio source
to get audio from a wave file. This audio source object would probably
have a custom interface with functions like Open and Close that let the
app select which file to use.
The app would create an SR engine enumerator object (not shown in
Figure 6,
but provided by Microsoft), and use it to find the SR engine it wants
to use. You can search for engines that support specific languages or
features, the same way you might look for a font with serifs. For
example, the transcription program might look for a SR engine that
supports context-free grammars. (I'll explain what that is in a moment.)
Once it finds the right engine, it creates an instance of it and passes
it the audio source object.
The SR engine object has a dialogue with the audio source object to
find a common audio format. For example, 16-bit 11KHz pulse code
modulation (PCM). Your custom audio source would read the format from
the wave file to check that the file is in the right format. Assuming it
is, the engine registers an audio source notification sink with the
audio source object. Now the audio source object submits digital audio
data to the engine through the notification sink. All this happens
invisibly to the app, which only has to set things up.
The app next registers a main notification sink that receives
grammar-independent notifications such as whether or not the user is
speaking, or is speaking too loudly. You could use this information to
tell the user to speak softly. The transcription program would use it to
figure out when the user starts or stops speaking.
Next, the transcription program creates a grammar object. This plays
the same role as the voice menu object, except a grammar object
recognizes much more complex speech patterns. When you create a voice
menu, you provide a list of phrases to listen for; when you create a
grammar object, you provide a set of rules called a context-free grammar
that specifies which words can grammatically follow one another. A
typical a rule might look something like this:
<Start> = [please] send mail to (Mike | Fred | Bob)
You can probably decipher the notation yourself. "Please" is
optional; while the parenthesis and | (logical OR) symbols indicate that
either Mike or Fred or Bob is expected. A user could say, "Please send
mail to Mike," or "Please send mail to Bob," or "Send mail to Bob," and
so on.
If the transcription program can't predict in advance what it's
listening for, it would forgo the context-free grammar approach and opt
for dictation. While context-free grammars are quite rich, they are not
very efficient. It would take more memory than your computer has to
store a context-free grammar for English. A dictation grammar is a
different kind of grammar with special tricks for reducing the number of
rules. A dictation grammar lets you express rules like "verb and noun
must agree in number."
Whichever grammar you use, the grammar object notifies your app when
something happens through yet another sink, the grammar notification
sink. When the grammar recognizes a word or phrase, or has other
grammar-specific information to report, it calls functions in the
grammar notification sinks. Your app implements a sink that responds by
taking whatever action it wants. The most important notification is
PhraseRecognize. The grammar provides a text string of the spoken words.
The transcription program would write them into a text file, perhaps
along with timing information.
Typically, the engine knows a lot more than just what was spoken. It
may have a list of alternative phrases (was it "Swing the cat" or "Swing
the bat"?), timing information, or information about who is speaking.
You can request a results object and interrogate it to find out more.
This is how you'd get timing information.
The low-level speech objects are designed to support just about any
feature a contemporary speech engine might offer. Because the API is so
broad, not every engine supports every interface. This is especially
true with results objects. For example, every engine can return the
spoken words and timing information, but very few can identify the
speaker. As a way of dealing with this, the formal API specification
identifies a core set of mandatory features, and provides a mechanism to
query which of the optional ones a given engine supports.
So much for speech recognition. The low-level text-to-speech objects
are similar, but not as complex. An example of a program that might use
low-level TTS functions is a mixer program that merges TTS with an audio
file. You might write some poetry as text, then mix it with some MIDI
music to create your own multimedia art. (If you do, please do not send
it to MSJ.)
To implement the TTS mixer, you'd have to implement a custom audio
destination object to receive the spoken words. Microsoft supplies an
audio destination object for the default multimedia wave-out device
(sound card), but you can implement your own. The mixer would need an
audio destination that mixes the TTS signal with background music. Your
custom audio destination object would accept digital audio from the TTS
engine and, for every sample received, would read a sample of equal
duration from the wave file, add the amplitudes, and send the combined
audio to the multimedia wave-out device-or perhaps write it to another
wave file.
The same sort of handshaking goes on as with SR. You'd use a TTS
engine enumerator object to find a TTS engine with the desired features,
then hook the engine up to your custom audio destination. As part of
the "hooking up," the TTS engine would have a dialogue with the audio
destination object to find a common audio format, then set up an audio
destination notification sink that your audio destination object would
use to inform the engine when it starts or stops playing, or when your
buffers are overflowing, and so on (see
Figure 11). As with speech recognition, the handshaking happens invisibly to the app.
Figure 11 Low-level TTS Objects with Custom Audio Destination
The app registers a main notification sink that receives
buffer-independent notifications, such as whether or not the engine is
speaking, and what the lip-positions are. Lip positions are typically
used to synchronize speech with animation or other real-time events.
When the mixing program is ready to mix, it passes the engine one or
more text strings, which are "spoken" to the audio destination. While
the voice text object accepts only text, the low-level API lets you send
phonetic descriptions or tagged text as well. You might use phonetic
information to ensure that foreign names such as Grbac or Tchlzinski are
pronounced correctly; while tagged text can contain bookmarks or other
special embedded codes that tell the TTS engine which words to
emphasize, when to change its voice, how quickly to speak, how long to
pause between words and so on. When the engine "speaks" it's really just
sending digital audio to the audio destination object, which decides
what to actually do with it. Instead of sending the audio to the sound
card, your custom audio destination would mix it with background music.
If you want to know when particular words are being spoken, in order
to synchronize the music to the words, you can insert bookmarks into
your text and register a buffer notification sink for each text buffer
you mix. When the TTS engine reaches a bookmark within the text, it
calls functions in the buffer notification sinks. A bookmark is just a
special tag embedded in the text (for example: "...\mrk=3453\...") that
sends a notification to the app rather than being verbalized.
Reality and Some Words of Advice
Speech
recognition and text-to-speech let you create programs that listen and
talk. They add a whole new dimension to a user interface. Unfortunately,
the technology is still a long way from Star Trek. Full dictation still
requires very fast hardware such as a dual Pentium or P6. And simple
speech recognition isn't good enough for some purposes, like dialing
individual digits over the phone. Even if an engine gets 99 percent
accuracy per digit, after the user speaks ten digits in a row, there's
only a 90 percent chance they're all correct-and that doesn't include
your calling-card access number! It might be a great way to meet new
friends, but most people won't accept the error rate.
Occasionally apps like Clock will get a CommandRecognize for the
wrong command. If another program is listening for "What pay is it?"
there's a chance the engine might mess up. The percentage of time a
recognizer gets the correct answer is called "accuracy." Accuracy
depends a lot on what's being listened for. Like humans, computers tend
to confuse similar-sounding words. If all the commands are relatively
dissimilar, you can get pretty good accuracy, up to 98 percent. If
that's not good enough, you can always try changing your commands to
something else, like "What's the time?" or "Computer, what time is it?"
Another common problem is that SR engines like to hear. If a user
says, "What mime is it?" or "What mines it?" there's a good chance the
recognizer will hear "What time is it?" Occasionally, a user will say
something completely different like "Go away, you slime," but the engine
will again recognize "What time is it?" The ability for a recognizer to
reject what the user said is called, not surprisingly, rejection.
Unfortunately, SR engines aren't as good as humans at rejection. They're
not as picky. You should take this into account when you design your
speech app.
And of course, don't forget that while sound is becoming more and
more prevalent on PCs, not everyone has a sound card, speakers, and a
microphone. Even those who do may not want their computers jabbering
away at them, or have to listen to themselves jabbering.
So while speech can be extremely useful in many places, it's best to
use it sparingly. For all but very specialized applications, speech
should be optional. And even if speech recognition advances to the point
of Star Trek, there will still be places where it's inappropriate. In
my opinion, you wouldn't want to write an action game that made the
player say "fire" to shoot his weapon, because it would always be faster
to press the trigger.
In the future, Microsoft will extend the Speech API to add
intelligence to dictation systems so they don't just transcribe
word-for-word, but act more like a real person. For example, rather than
returning "October first nineteen ninety five" when these words are
spoken, they'll come back as "October 1, 1995." Microsoft will also
enhance the Voice Commands module to accept more natural speech, so it
can recognize "Tell me the time" or "Give me the time" as equivalent
forms of "What time is it?" without any extra work from the application.
These improvements take advantage of advances in speech technology from
independent engine vendors.
Where To Get It
I've only touched on
some of the capabilities of the Microsoft Speech API. Complete details
can be found in the Microsoft Speech SDK, which at the time of this
writing was in final beta and should be released to manufacturing by the
time this article appears in print. For now, Microsoft Voice is being
distributed along with the SDK which should be available on the March
'96 MSDN CD-ROM. This includes executables, which you may distribute
royalty-free with applications or speech engines, as well as
documentation, tools, and sample programs for you to use. Microsoft
Voice will be distributed by OEMs with different machines and/or sound
cards and may also be bundled with future Microsoft products. In the
interim, if you want more information, or wish to obtain a copy of the
SDK and Speech run times, send email to MSSpeech@microsoft.com.
Microsoft Voice
Paul DiLascia
When
MSJ asked me to check out the latest in speech technology
from Microsoft, I popped the Microsoft Voice floppy (actually, there
are two) into my multimedia machine-a 486/66 with 28MB of RAM, a
SoundBlaster 16 with cheapo speakers and a $10 microphone from Radio
Shack-and typed SETUP.
After the usual installation wizard stuff, I got yet another icon added to my task bar (see
Figure A). Clicking it gave me the menu in F
igure B.
I selected Properties and got a tabbed dialog that let me control
various options, the most interesting of which is what voice I wanted my
computer to have (see
Figure C). There are several characters to
choose from, with names like Deep Douglas, Eager Eddie and Grandpa
Amos, all of whom sound like they're a few days shy of full recovery
from a laryngectomy. Peter is the default and least grating among
them-but just for fun, I selected Wanda, who sounds like a witch with
her broom in the wrong place.
Figure A
Figure B
Figure C
When you first install Voice, you get a brief tutorial that asks you
to say, "What can I say?" When you do, a list of voice commands pops up.
You are then instructed to say "Close window." No matter how many times
I did, that darned window just wouldn't go away! I kept getting the
same sequence of ToolTip messages: "Heard. Not recognized. Please speak
louder." When I yelled into the mike, I got the same sequence, sans
"Please speak louder." I pressed Alt-F4 to close the window.
I fiddled around a bit-adjusted the input volume and gain, turned off
my radio, held the mike close to my lips, and even "trained" Wanda to
recognize my voice by repeating, at her request, the digits zero through
nine plus nineteen short phrases including "Who am I?" which felt very
existential. Eventually, I got it to work.
In fact, it worked pretty darn well! I was impressed. I said, "Start
running Microsoft Word" in a normal voice and, sure thing, Voice
launched Word! (When you install Voice, it scans your entire disk for
programs and adds a "Start running X" command for every app it finds.) I
said, "File New" and it created a new document. I said, "Switch to
NDOS" and it switched to WinCIM. Well, that's OK, I can forgive Wanda
for not knowing how to pronounce NDOS. I said "Next window" several
times to cycle the windows until I got to my NDOS window. Just like
pressing Alt-Tab. Wanda was able to consistently recognize other generic
commands like "Close window," "Minimize window," "Press cancel," "Press
enter," and "Show help."
Any time you run a program, Wanda automatically adds its menu to her
repertoire, in effect turning any out-of-the-box Windows-based app into a
speech app. I tried it on my TRACEWIN program from October's C/C++
column, and I was amazed that Wanda was able to recognize "Trace output
off," "Trace output to window" and other TRACEWIN commands with no
trouble. She's a pretty good listener, actually. Even if she can't talk
too well. She had no problem recognizing my wife's voice, either-though I
thought I detected a slight hint of jealousy in her responses,
laryngectomy aside.
If you ever find yourself speechless, all you have to do is ask, "What can I say?" to get the window in
Figure D,
which lists everything you can say. I got global commands like "Show
help" and "What can I say?" as well as TRACEWIN commands like "Trace
output off."
Figure D
To check out text-to-speech, I opened my draft of this text, selected
the first paragraph, and said, "Read selection." Wanda read it
flawlessly in her raspy monotone, which by now seemed almost tolerable.
She pronounced MSJ correctly as initials, 486/66 as "four-eighty-six
slash sixty-six", lowered her voice when speaking parenthetically, and
even converted $10 to "ten dollars." I did not fail to notice, however,
that she pronounced the word "Microsoft" with suspicious clarity,
leading me to suspect a few extra "if" statements in the code; whereas
"SoundBlaster" came out like "SoudBlaster"-but then it turned out I had
in fact misspelled it exactly that way! Now I started to feel downright
uneasy-Wanda was already finding my flaws.
When I turned on keyboard commands, which let you enter text by
spelling, things started turning surreal. I said, "Pee ay you
el,"expecting to see my name, but it came out "88d." I figured that
Microsoft needed to go back to the drawing board on that one. But no, it
was my fault again; you have to use international alphabet mnemonics
like Alpha, Bravo, Charlie, and so on to Zebra. Fortunately, I have my
pilot's license, so I know that stuff by heart. I said, "Capital-papa
alpha uniform lima," pausing several seconds between each word, and,
sure enough, "Paul" typed itself magically into my doc! But when I
spelled "DiLascia," the Find dialog popped up because Wanda thought I
said "F3." Oh well, no one ever spells my name right anyway. Wanda got
it the second time, but when I reached the "s" in "DiLascia", no matter
how precisely I tried to enunciate "sierra," Wanda insisted on hearing
it as "zero." At first, I took it as an insult, but then I realized she
was just being her typical computer self, preferring digits to letters.
So I forgave her. (I think I hurt her feelings, though, because after
that she would every now and then for no apparent reason ask, via her
ToolTip window, "Is your microphone plugged in?" There was nothing wrong
with the microphone. I like to think she was just hinting that she
wanted me to say something. As Mr. Rozak says in the article, speech
engines like to hear.)
If you're wondering how well Wanda performs, well, I have to say
she's in no danger of winning any speed dictation trophies. At best, she
can handle about one command every five or ten seconds on my 486. Also,
when Wanda listens, she gobbles CPU cycles the way Arnold
Schwarzenegger gobbles roast beef sandwiches. Everything turns to
molasses. To avoid processor gridlock, you can set things up so you have
to press a key or move the mouse to the upper-left corner of your
screen to make Wanda listen.
So, what's the bottom line? Well, I definitely wouldn't use Wanda to
get any real work done unless I broke both my hands-and even then I'm
not sure it wouldn't be faster to type with my elbows. But there's
definitely some very real and impressive technology at work here.
Text-to-speech is, not surprisingly, better than speech recognition.
Maybe in another couple of years. But no matter how flawless the
technology becomes, you won't ever catch
me talking to my
computer. It seems silly. TTS seems more useful. I can see having my
computer read an article back to me, and I really like the way, even
today, dictionary and encyclopedia programs can pronounce words and
foreign place-names. And if they could just make Wanda sound a little
more like Stevie Nicks, I might not mind her occasionally asking if my
microphone is plugged in.
It sure makes for great demos, though. Just be careful whom you show
it to. Now whenever I ask my wife when dinner'll be ready she says:
"Heard. Not recognized."