Voice interaction is different in many ways from “typing” interaction. One difference is the delocalisation of the computer/machine. You don’t need to be physically connected (with your fingers typing) you just need to be heard. This has some interesting perceptual implications.
In spite of the image I have shown at the beginning of this post, you are not perceiving Alexa, Siri or whatever as embodied in a device you are talking to. These entities are ubiquitous in the ambient you are in, you are talking to a presence in the ambient, actually the ambient has become responsive. Along the thoughts I expressed in the previous post, a significant improvement in the interaction would be obtained if the computer becomes aware of the context, i.e. it doesn’t just listen to what you say after the wake up word (Alexa, Hey Google…) but keeps listening to you and knows what you said (or others present in the ambient said) in the last 30 minutes or more. Privacy concerns are making the companies supporting these voice interaction systems telling you that their device is only waiting for the wake up word and does not listen to anything you said before or after the command has been processed. I assume this is true (and this is the answer I gave to one of the questions posed by the journalist), however would I (or you) be willing to trade part of my (your) privacy for a much better interaction? Couldn’t we trust the assurance from those companies that what the device (software) is learning about me and the ambient situation will only be used to ensure a better interaction and nothing else? Well if I have decided to trust them when they claim the software is only listening to what I am saying immediately after the wake up word, why shouldn’t I trust them when they will be saying the the continuous listening is used only to improve the interaction?
Notice that today Alexa and its siblings interact/control a limited number of appliances in the home but their number, and diversity, is going to increase in the future. You will be using Alexa has the voice of the home, you will be calling Alexa to make sure there is no leaking faucet, that the cat got back home, that the homemade did wash the curtains and much more. Eventually, we will be talking directly to the ambient, forgetting we are doing so through an intermediary.
I think this will be a significant departure from the way we perceive the interaction with a computer today, Interaction will be with the ambient and we are going to feel our ambient aware and responsive, Notice how I no longer talk about the home ambient but more generally to our ambient, be it the home, the office, the car, a shopping mall or a hospital. What I am saying is that the future of interface will be shaped around ourselves. Now the header of my posts in this series should start to make sense. It is our human interface to whatever, not to a machine. And it will be tied to us, not to a specific machine, device computer, exactly as today I am interfacing with other people using my interface.
Voice interaction technology (ASR: Automatic Speech Recognition, NLU: Natural Language Understanding) will be instrumental in this transition affecting our perception. The more fluent the understanding will be, and the more articulated the answers we will be receiving, the more convincing the interaction of a human like being. Technologies like sentiment analyses and affective computing will shift the perception towards a sentient being rather than a machine, getting really close to to human to human interaction.