Java Explorer - VoiceXML

Talk XML

A VoiceXML overview

VoiceXML is the new beast in the XML jungle. It has all the makings of a great enabling technology. It has the blessings of W3C, the World Wide Web consortium, and has captured the imagination of the industry titans in a big way. What remains to be seen is what the market captains will make of it.

For the developers, nothing could be more exciting than voice-enabling the applications they developed for the desktop. As voice technology matured, developers need wait no longer for the market to develop. No less than a pack of 500 companies invested time, money and effort in giving voice technology the shot that it sorely needed since the mobile phone has busted the popularity charts. Currently IBM, along with a few others, is leading this pack, drafting and submitting the VoiceXML specification for the approval of W3C.

A voice XML document is like any other XML document, except that it has a defined set of mark-up tags designed for delivering voice content, much like HTML is used to deliver visual content. A voice server can parse the document and process its contents, while a voice browser (usually bundled with the server) is used to simulate user interaction.

VoiceXML is therefore an XML application conforming to the VoiceXML specification, currently in version 2.0. It is a dialog mark-up language that closely resembles the visual mark-up language of the Web, the HTML. A voice XML document consists of one or more of the following components:

Dialog Constructs
User Input
System Output
Control flow and scripting
Voice Grammar

This is the order in which this article deals with the subject. You may like to see the official VoiceXML specification, a working draft at W3C since 24 April 2002, at: http://www.w3.org/TR/voicexml20/.

VoiceXML allows us to mark-up speech content using tags. The tags provide interaction between the human being and the computer. A voice processor, interpreting these tags, performs the following tasks:

Dialog Constructs
- Menus
- Forms
- Fields and Prompts
User Input
- Speech recognition (Automatic Speech Recognition - ASR)
- DTMF tone detection
System Output
- Text-To-Speech conversion (Speech Synthesis - TTS)
- Playing pre-recorded audio content
Control flow and scripting
- Human-computer dialog flow control using subdialogs, variables and events, together with ECMA script (a standardized version of JavaScript).
- Telephony interface for call transfer and disconnect
Voice Grammar
- Built-in grammar
- JSGF, GSL and others

Voice applications aim to provide a conversational experience to the user. It is typically of a question-and-answer kind that you normally encounter at a ticketing counter or a help desk.

Here is a typical VoiceXML document, a text file having a file extension vxml. Note the use of menu and form. In an application that relies solely on user input through voice dialogs, it is essential to provide for silences and repetition of menu prompts. The text in plain is what the user hears, the rest is processing instruction. The document delineates the different parts of the application with comments in red.

Example 1.0

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE vxml PUBLIC "vxml">
<vxml version="1.0">

    <menu id="startMenu">
        <prompt bargein="true">Please select an option
            <break msecs="500"/>
            <enumerate/>
       </prompt>
       <choice dtmf="1" next="#checkMail"> 1 Check Mail</choice>
       <choice dtmf="0" next="#exit">Press 0 to exit</choice>

       <noinput>
             <prompt bargein="true">I repeat, please select an option
                  <break msecs="500"/>
                  <enumerate/>
             </prompt>
       </noinput>
    </menu>

    <form id="checkMail">
    <block>
         You have 2 unread mails.
         <goto next="#exit"/>
    </block>
    </form>

    <form id="exit">
        <block>Thank you for calling us. Good Bye! <disconnect/></block>
    </form>
</vxml>

Before we go into the document constituent parts, let us see how this translates into a human-computer interaction.

Computer: Please select an option [pauses for 0.5 sec, then enumerates] 1 Check Mail Press 0 to exit
Human: [Assume no input from the user]
Computer: [repeats the prompt] I repeat, please select an option [pauses for 0.5 sec, then enumerates] 1 Check Mail Press 0 to exit
Human: [Assume the user presses 1 on the voice number keypad]
Computer: You have 2 unread mails [then exits with a message] Thank you for calling us. Good Bye!

For anyone who is used to seeing XML tags, or even HTML tags for that matter, the VoiceXML document is not terribly difficult to follow. You will notice that the tag names are carefully chosen to reflect the functionality they are designed to deliver.

The document starts like any other XML file, followed by the Document Type Declaration that says this is a VoiceXML file with a standard file extension vxml. The document is rooted in vxml tag. The next line in red is a comment that indicates the start of a menu. In real time, you would of course play a welcome song, and a melodious voice to greet the user and to ask for the identification number.

Menus have an id attribute that is used to refer to it elsewhere in the document. It is also possible to launch a menu from another file by referring to it via a URL. A menu must contain options to select from, or it is not a menu worth the name. The options within a menu are contained in choice tags, which have a next attribute. The value of the next attribute tells the processor to jump to a new location either within the document or somewhere in the network specified by the URL. A choice element can also have a dtmf attribute to accept user input from a telephone keypad.

A prompt is what it means in plain English, an aid to the user to do something (like pressing a digit on the voice keypad) or say something into the mike. The barge-in attribute of the prompt tag allows user to interrupt a prompt or a menu, and issue spoken commands or press digits on the voice keypad. When the user provides no input after a time (processors provide a default wait time), you may repeat your request via the noinput tag. This is how silences are processed, and it is important to handle them. Silences are a part of human speech, and must not be ignored.

A form, like a menu, has an id attribute, and contains one or more block elements. Blocks are named tags and are used to mark off user interface units within a form, and the control can be moved back and forth between blocks. In addition to the name attribute, they can have a cond attribute that enables dialog flow control.

To complete this quick and brief explanation of the sample document, we take a look at the other elements of the document:

break: You use it to provide a brief pause in the dialog. Use it judiciously - if too long, the user begins to get anxious, and if too less, it serves no purpose at all and adds to processing time.

enumerate: This tag facilitates the processor to read the choice options in the defined order. It refers to the choice items declared in the menu in which it is embedded.

disconnect: This is a telephony tag that provides the telephony feature of a call disconnection.

goto: Use it to jump to another section of the code.

Dialog Constructs
User Interface elements form the building blocks for dialog constructs. It is useful to know what elements are available and for what purpose. Each element has one or more attributes which define how the element's content is to be executed. A VoiceXML DTD specifies which elements are valid in a document, and which are their legal parents and children, since the elements belong to a containment hierarchy like any valid XML document.

Dialogs are the only means of interacting with the user in a voice-only scenario, and must be constructed with a great deal of thoughtfulness and experimentation. Play out the scenarios in the mind and put it down in the form of a dialog script, then test the script extensively with different users (other than developers). The dialog script can be much like a play written for the theater, only it must be very polite, and geared completely to the needs of the user. At no point of time the user must be left guessing, or to become anxious because of prolonged silence. Silence is an inherent part of all dialogs, and must be treated with respect, and handled efficiently.

Dialogs can be enhanced by the use of voice grammars. A grammar rule may restrict input to a set of allowed words so as to eliminate possible ambiguity. That is, you may provide a vocabulary to the user that is considered legal in a particular context. There is a built-in grammar that is provided as per the VoiceXML specification, but you may also use grammars provided by a particular implementation specific to a platform. If you would develop a voice application that is to be compliant with the VoiceXML standard, then you should check out the documentation of the platform-specific voice server you are going to use for the extent of functionality support.

The tags specific to dialog constructs are given in the table below, along with their attributes and functionality, and where each element belongs in the element hierarchy.

Table 1.0

Tag/Element	Attributes	May Contain	Contained By	Functionality
prompt	barge-in value = true/false for interrupting a prompt. Default is true	break, enumerate, audio, div, emp, pros, sayas, value	block, catch, error, field, filled, help, if, initial, menu, noinput, nomatch, object, record, subdialog, transfer	Requests the user for input via pre-recorded audio or from text via TTS
	cond conditional expression
	count value = n activates prompt after playing n-1 prompts
	timeout time in msecs before activating noinput. Defaults to platform-specific value, if not given
menu	id unique identifier for the menu	choice, prompt, audio, catch, enumerate, error, help, noinput, nomatch, property, value	vxml	A dialog to prompt user to make a selection from a list of choices
	scope specifies extent of grammar validity. value = dialog is default, limited to the menu. value = document extends scope to all dialogs in the document & sub documents, if defined
	dtmf value = true/false associates DTMF grammar with choice elements. Deafult is false
form	id unique identifier for the menu	block, catch, dtmf, error, field, filled, grammar, help, initial, link, noinput, nomatch, object, property, record, subdialog, transfer, var	vxml	Container for elements that enable a single dialog. Dialog can be one-way or two-way.
	scope specifies extent of grammar validity. value = dialog is default, limited to the menu. value = document extends scope to all dialogs in the document & sub documents, if defined
block	name Default is undefined	assign, audio, clear, disconnect, enumerate, exit, goto, if, prompt, reprompt, return, script, submit, throw, value, var	form	Container for other elements. Used to prompt the user or process data.
	expr block execution depends on the use of clear element in the same form
	cond conditional execution

We have seen the usage syntax of prompt, menu, form and block in the above example. The form element deserves a special treatment because it is a bit involved and there is a certain way in which form items are processed. A menu can be seen as a form with a single field and its choice element corresponding to a filled element in a form.

When processing form items, a Form Interpretation Algorithm (FIA) swings into action. The FIA has four discernible phases:

Initialization Phase - The form is initialized when it is entered and a prompt is executed with a counter set to 1. Variables are initialized in the document control order. Grammars are activated, if specified. Multiple fields are processed, and any filled elements executed.
Selection Phase - The next form item is selected following a goto element.
Collection Phase - User input is collected with prompt and field elements. The input may be a spoken command or phrase, or a touch tone key press (DTMF). An event processing may also follow such as a timeout when there is no response from the user.
Process Phase - User input is processed and validation can be performed by filled element. Event processing is performed by a event handler such as for a help request.

User Input
User input is of two kinds - the spoken input and the DTMF input. Sometimes both are used in a single application. DTMF stands for Dual Tone Multi Frequency; it is a standard set of codes for the key presses from a telephone handset.

The spoken input is handled by the Automatic Speech recognition (ASR) engine, a part of the VoiceXML document processor. Though the technology is developing rapidly, it is far from what we desire it to be. Speech systems run complex speech recognition algorithms, and are sensitive to accents, pitch, tone and background noise. The best user interface is one that needs speech rarely, and limited to single commands or at the most to short and crisp sentences.

DTMF input is the most reliable and least intrusive user interface mechanism that I personally prefer, especially when I am interacting with an application dealing with sensitive data over an open line. In using DTMF, it is advisable to keep a consistent interface model, for instance, pressing 0 always exits an application, * for next, and # for previous, and so on. A consistent user interface adapts the user quickly to the application, and enhances the user experience.

Table 2.0

Tag/Element	Attributes	May Contain	Contained By	Functionality
choice	next jump to next dialog or document via URL	audio, break, div, emp, enumerate, grammar, pros, sayas, value	menu	A menu selection item. Choice items are played out by TTS with the enumerate element. Only one of next, expr, or event must be specified.
	expr expression for evaluation
	event event to throw
	dtmf alternative to spoken input, a touch tone sequence
	caching value = safe indicates content is always refreshed; value = fast implies a cached copy to use
	fetchaudio specify an audio file to play while waiting for a process to complete
	fetchhint legal values are: prefetch - loading in advance safe - load only when needed stream - stream media from a URL
	fetchtimeout wait time before throwing an error
dtmf
record
field	name variable to hold input value	audio, catch, dtmf, enumerate, error, filled, grammar, help, link, noinput, nomatch, option, prompt, property, value	form	Collects user input via prompt and filled tags.
	expr name may be initialized with this, default is undefined
	cond conditional entry
	type holds the name of a built-in grammar - phone, number and time are examples.			The name attribute has 3 shadow variables: confidence - certainty level 0.0 to 1.0, utterance - text string transcribed from speech, inputmode - voice or dtmf mode of response
	slot for use with grammars that return slot/value pairs. variable holds the name of the slot
	modal value = true/false; default false activates all document grammars during field execution; true limits to the fields own grammar.
filled	namelist space de-limited list of form items under this filled element	assign, audio, clear, disconnect, enumerate, exit, goto, if, prompt, reprompt, return, script, submit, throw, value, var	field, form, object, record, subdialog, transfer	When the ASR receives a response, the execution moves to the filled element.
filled	mode specified when filled is at the form level; legal values: any, all. any refers to any one item in namelist. When it is all, processor executes when all items in namelist are filled		field, form, object, record, subdialog, transfer

The choice element with its dtmf attribute usage syntax is shown in the above example. We will see how the field element is used.

Example 2.0

<form id="passForm">
     <block>Please enter your identification. </block>
     <field name="password" type="number">
           <filled>
                 <if cond="password == 123">
                       <prompt>Please wait while we process your request.</prompt>
                       <goto next="#startMenu"/>
                 <else/>
                       <prompt>
                              You pressed <value expr="password"/>
                              <break msecs="250"/>
                              Incorrect user ID.
                       </prompt>
                      <goto next="#exit"/>
                 </if>
           </filled>
     </field>
</form>

The above snippet is a simple authentication mechanism for a voice application. The field element is embedded in a form element that has an id attribute set to the value passForm. The form starts with a block that prompts the user to enter the identification number. The field element then collects the user input via the filled element, and branches out to go on to the next menu or exit depending on the value of the user input. You will notice that the user input is stored in the field's name attribute; its value password is thus a variable to hold user input, and is later (in the else code) retrieved using the value element.

The field element also uses a type attribute set to number, so the user is restricted to provide only a number - this is equivalent to HTML text field requiring only a numeric input. The type attribute is limited to a few built-in values. A field that does not use the type attribute must specify a grammar element. Here in lies the rich functionality associated with external grammars that power the speech recognition engines. The use of this approach is covered in another article: The Speaking Web.

The filled element executes after a response from the user either through speech or a touch tone sequence of dtmf as in the example above. A conditional expression evaluates the user response and executes the appropriate prompt element. If the input is spoken, then the confidence level of the speech recognizer can be checked as shown below:

Example 2.1

The password variable of the field element in the previous example 2.0 has three shadow variables: confidence, utterance and inputmode. See Table 2.0 under the field element functionality. In the above example 2.1, the shadow variable is invoked by appending the $ sign to the field element's name value.

System Output
The output of a voice system is obviously voice. It may be synthesized speech or an audio file. TTS engines deliver synthesized speech from text, and is the stuff behind voice portals and the speaking web. With VoiceXML, it is possible to mark-up text so you can deliver it in the most natural way you can think of - adding nuances, inflexions, manipulate pitch, tone and cadence, and much more. TTS has evolved admirably, and currently it is built into many systems on the web. Incidentally, it is showing a lot of promise as email reader. There are a few companies out there who are doing it, including the company where this writer is working.

Tag/Element	Attributes	May Contain	Functionality
audio	src

Control flow and scripting
Dialog flow control is achieved in two ways - using flow elements like the goto tag, and through embedded ECMA scripting code, usually a mix of both. Event handling is also part of it, and we will see examples of simple event handlers.

Tag/Element	Attributes	May Contain	Functionality
goto	next
subdialog
script
submit

Environment & Resources
Finally, we take a look at the development environment for VoiceXML applications. We will also show links to the web's repositories for information related VoiceXML applications and tutorials.

... to be continued ...