Java Explorer - The Voice Web

The Speaking Web

With the advent of the voice technology that promises to bring the world wide web to your telephone and the mobile phone, we can say that the web now can speak.

Today one hears a lot about the voice portals that together with voice browsers enable the user to talk to the computer, that is, issue spoken commands and receive audio content, in addition to the regular visual fare which is available to the desktop computer. The audio content is simultaneously made accessible to the mobile devices such as the PDAs and the mobile phones, and even to the land lines at no extra effort, infrastructure or cost.

In this article we will focus on voice applications, with particular emphasis on speech aspects of VoiceXML, the XML mark-up standard created especially for the vocalization of the web. [ I intend to introduce voiceXML soon, so keep looking in the <?XML?> category. ]

While technology innovators like IBM exhort a great many applications in the voice area, one application that the web developers sorely need is the accessibility of mail via the regular phone lines. I don't mean voice mail, but a voice interface to the regular text mail. This could be a killer-app on the web, if handled right.

Imagine a scene like this one. You are on the move and you don't have access to a computer; you may have a laptop but not an Internet connection. You of course have your mobile phone alerting you about new mail from providers like Yahoo! But you cannot open your mail box unless you have your desktop or laptop. Never mind the computer, you can still connect to your mailbox.

You will dial from your mobile phone to a special number given to you by your local Internet Service Provider (ISP). You hear a welcome song, and presently a melodious voice asks you to identify yourself. If you are authenticated, you hear a voice menu. It plays the following options to choose from.

Check mails
Hear mail subject line
Read mail
Reply mail
Delete Mail
Move to a folder on the host computer
Exit

You will of course do as prompted and when finished you will quit gracefully without once missing your laptop or desktop. All this at the price of an ordinary local telephone call, and without the hassles of a computing machinery!

This is possible because of two vital technologies: the TTS and the SRS. The voice application has at its core a Text-To-Speech converter and a Speech-Recognition-System. While the TTS engines now available is fairly advanced, the SRS is not far behind. These technologies are bundled as Voice Servers, and form the backbone to the voice portals that can communicate with voice browsers and hand-held devices.

The content providers can now provide a voice interface to the web applications, and can therefore reach a wider client base by including those who do not have a computer but have access to a telephone. This is the stuff of voice portals, and can be easily extended to include services such as web call centers, in addition to the scenario outlined above.

From the programmer's point of view, nothing could be simpler since you don't have to know a thing about speech technologies at all. All you need to know is a mark-up called VoiceXML, and the accompanying grammar format for speech such as Java Speech Grammar Format, JSGF, or Nuance Grammar Specification Language, GSL.

We will now look at these voice application builder's tools: VoiceXML and JSGF.

This article does not explain VoiceXML in detail, nor does it cover JSGF sufficiently for a developer to start off on his/her own. An attempt is made, however, to bring these technologies to the notice of the developer, and can therefore be regarded as a quick-starter, and nothing more.

VoiceXML, like XHTML, is an XML application. It contains tags specific to render human speech to text and vice versa, and provides some built-in functions to perform some routine tasks. VoiceXML is an industry standard, endorsed by the World Wide Web (W3C) consortium, and, according an IBM white paper, VoiceXML is defined and promoted by an industry forum, the VoiceXML Forum, founded by AT&T, IBM, Lucent and Motorola and supported by around 500 member companies.

What are the features of VoiceXML?
A VoiceXML document is an xml document that has tags to describe

telephone controls like dial-up, disconnect, tone detection, key presses
spoken commands
dialog flow control

A VoiceXML document consists of forms and menus that the developer uses to retrieve information from the user, or provide prompts as options. These form the dialog controls in a voice application. Together with a grammar and scripting elements, a voice application provides all that a developer works with a regular desktop user interface. While the grammar relates to nuances of speech and the like, the scripting provides the processing capability in a voice application. Naturally, we have events, variables, and flow controls.

Here is a VoiceXML document that captures some of the features mentioned above.

<?xml version="1.0"?>
<vxml version="1.0">
   <menu id="quiz">
      <prompt>Name the first planet of the solar system.<enumerate/>
      </prompt>
         <choice next="#wrong">Venus</choice>
         <choice next="#wrong">Earth</choice>
         <choice next="#right">Mercury</choice>
         <choice next="#wrong">Jupiter</choice>
      <noinput>I repeat, name the first planet of the solar system.<enumerate/>
      </noinput>
   </menu>
<form id="right">
    <block>
      Congratulations, that is the right answer.<disconnect/>
    </block>
</form>
<form id="wrong">
    <block>
     OOPS! You are out.<disconnect/>
    </block>
</form>
</vxml>

It is a simple questionnaire that you play out to the user. The answer is a choice tag, while the question is a prompt. A menu tag encloses the choices and the prompt. A form processes the answer, and disconnects after playing the appropriate message. The document is rooted in the vxml tag, and the usual xml processing instruction precedes all VoiceXML documents.

A voice server processes this document for prompting the user with a question and waits for the response for further processing. When this document is run, the following human-computer interaction will ensue:

Computer: Name the first planet of the solar system. Venus, Earth, Mercury, Jupiter.
Human : { remains silent }
Computer: I repeat, name the first planet of the solar system. Venus, Earth, Mercury, Jupiter.
Human : Mercury
Computer: Congratulations, that is the right answer.
Computer: { disconnects }

We will explain more tags as we go along, but first we must look a little at voice grammars, for they enhance the user experience in ways that are now commonplace in the web interaction.

... to be continued ...