I don’t know about you but I often wonder as I am talking on the phone to a person I never met how that person would look like. Is the voice pointing to a specific age, to some specific traits? Of course it is in general easy to tell a woman from a man, a grown up from a child… but that is basically it.
Now I read the paper presented by researchers working at the MIC CSAIL to the Conference on Computer Vision and Pattern Recognition in 2019 in Long Beach, California, showing a way to create a software generating a face by listening to a voice.
You should take a look at the paper and try out different voices to see the faces that have been created to match that voice. They are also showing the real face that uttered the voice for comparison (it would be nice if they provide a way to capture your voice and create a corresponding face so that you can see how you would look like to that software).
One can immediately see that the original face is different from the one created but at the same time it is not that far off. The gender is captured correctly as well as the age (roughly… speaking).
How is this done? Well, they are using artificial intelligence, and more specifically they have trained a neural network with some 100,000 faces and related voices. Out of that the software has developed a knack for associating certain characteristic of a voice to certain facial features. In facial recognition the software identifies a number of points in a face (20 to 40 points are generally providing sufficient data to identify a face) and look at the relative distance among these points creating a vector pattern. What the machine learning algorithm did was to associate voices to specific pattern and on hearing a new voice the software looked for patterns generating similar voice characteristics and worked out an average that would reasonably match that voice.
I found this result amusing, on the one hand, and amazing on the other. I do not know how my brain works to imagine a face when it hears a voice but I would suspect that it goes through a similar pattern matching based on experience. I wouldn’t have imagined that we could move from speech2text to speech2face!