Toward natural conversation with devices

Behind-the-scenes look

At the beginning of 2017, we presented our initial prototype based on SiriKit. These first steps showed on the one hand current technological limitations, but on the other hand the enormous potential of conversational interfaces. In order to overcome limitations of SiriKit, this time we used Android as the platform for an additional prototype. In this article, we provide a glimpse behind the scenes and share our experience with its development.

We are convinced that in addition to touch and text, speech interfaces will play an increasingly important role as an input channel. The latest developments in speech-to-text, platforms for processing natural language, and software with cognitive abilities are very helpful to us. One reason for this rapid development is the fact that user data forms the essential foundation for the existing business models of many companies. At the same time, speech interfaces create new business areas, thus forcing companies to act in order to avoid getting left behind when customer behavior evolves due to increased prevalence of speech interfaces.

During development of the timetable app based on iOS / SiriKit, we determined that SiriKit currently supports only a very limited number of domains, and that the speech interface itself is very limited. So we decided to develop a second version of the timetable app based on Android. Our experience and results are described in the following sections. As you can see in the video, with this second prototype we have taken a big step forward:

Behind-the-scenes

The speech-based timetable app is essentially comprised of four components. The first component is the mobile app itself, which is installed on the smartphone and interacts with the user. The mobile app is responsible for speech recognition and for answering user questions or displaying results, in our case timetable information. The language is transcribed using Google’s Speech API. Android's TextToSpeech takes care of answering.

The actual analysis of the query to determine departure and arrival locations, for example, from the transcribed sentence "When is the next train from Zurich to Bern?", is not done by the mobile app itself, but rather by a second component that runs completely in the cloud. For this second component, we are currently using Google's platform API.AI. There are several other platforms that offer similar functionality, such as Watson from IBM or LUIS by Microsoft.

Within API.AI, one defines a so-called agent, which takes charge of analyzing the transcribed text, which means identifying the different entities - railway stops in our case. The agent can be trained to identify entities in a variety of different natural sentences. This makes it possible not only to correctly recognize "from Bern to Zurich", but also more complex sentences such as "Hello, I have to be in Zurich by 7 p.m. When do I need to catch the train in Bern?" The user shouldn't feel forced to adapt to the device. The goal is to facilitate a conversation with the smartphone that is as natural as possible.

Once we have determined the departure and arrival locations, we are ready to search for corresponding timetable information. We use Opendata Transport, yet another component, for this purpose. Since API.AI and Opendata Transport cannot communicate with each other directly, we have introduced a third component as a middle layer between API.AI and Opendata Transport. This third component is a Java-based web application we developed ourselves that also operates in the cloud. In the beginning, our web application was only responsible for mapping between API.AI and Opendata Transport and generating answers that were as natural as possible in text form. It now stores the conversations and supplies additional context information.

As already mentioned, Opendata Transport is the fourth component in our architecture that delivers the necessary routing or timetable information. The use of Opendata Transport is interchangeable in our web application so that one could switch to an alternate provider for timetable information if necessary.

Speech recognition and text-to-speech function at least partially without an Internet connection. However, an Internet connection is generally necessary for applications that interact with other providers/interfaces or rely on up-to-date information.

Next steps

Development of platforms such as API.AI is advancing quickly and opening up new opportunities for interacting with the user via speech interfaces. Even if speech interfaces never completely replace classic touch and text entry methods, this and other prototypes show that speech-activated assistants can be a tangible and audible asset in various everyday situations and can dramatically enhance the user experience. But convenience isn't the only factor improving acceptance. Another important factor is the possibility that a person could one day hold a natural conversation with a device. In order to achieve this, we want to improve in the following areas:

Simultaneous processing of multiple domains and the ability to deal with out of context phrases irrelevant to the query at hand.
Support for all languages and their dialects.
The device is always listening and easily recognizes if it is being addressed or is capable of making a contribution.
Recognition of and differentiation between multiple speakers.

A natural conversation is much more than just a transcription of speech to text and its processing. In future, we want people to be able to communicate with their devices naturally. We aren't there yet, but we're well on our way.

Published on: 05 May 2017