How does it work?

Building a voice-enabled agent is a lot like designing any other type of goal-oriented interface, but we need to keep in mind that the user is not looking at a display that shows many options instantaneously, or using text to describe their needs. The way users interact with voice is quite different than the way the interact with a graphical interface, and user expectations are different as well. These are just some of the things we must keep in mind when designing an agent and developing the conversational interaction between the agent and the user.

Let’s look at what happens in this voice interface.

The agent will initiate the conversation when activated, and will greet users with a welcome phrase.

After that, the agent will conduct a two-way dialog with your user until the user’s purpose is fulfilled.

For an understanding of what’s happening behind the scenes, let’s look at this diagram:

The NLU process, beginning with a user input and resulting in an agent response
  1. The user utters a phrase on their device, which is sent via audio stream from on-device processing to cloud processing

  2. Cloud processing consists of an initial stage of Natural Language Recognition. After a Speech-to-Text process, it moves to Natural Language Understanding

  3. The NLU process classifies the user’s intent, and the “AI brain” determines the correct response

  4. The correct response is returned to the user after the Text-to-Speech process

Now that we’ve gained some knowledge on voice user experience and a basic understanding of the system, let’s get started and build our pizza ordering bot.