Home / Blog / Speech2Face


Computers work out facial recognition by selecting specific points in a face and determining the ratio of distances among them. The upper faces correspond to real people, with dots indicating reference points in the face. The faces in the second row have been created by a software, based on AI, trained on how faces relate to speech. As one can see the faces generated by the software are not like the original one, but there are some similarities. Image credit: MIT CSAIL

I don’t know about you but I often wonder as I am talking on the phone to a person I never met how that person would look like. Is the voice pointing to a specific age, to some specific traits? Of course it is in general easy to tell a woman from a man, a grown up from a child… but that is basically it.

Now I read the paper presented by researchers working at the MIC CSAIL to the Conference on Computer Vision and Pattern Recognition in 2019 in Long Beach, California, showing a way to create a software generating  a face by listening to a voice.

You should take a look at the paper and try out different voices to see the faces that have been created to match that voice. They are also showing the real face that uttered the voice for comparison (it would be nice if they provide a way to capture your voice and create a corresponding face so that you can see how you would look like to that software).

One can immediately see that the original face is different from the one created but at the same time it is not that far off. The gender is captured correctly as well as the age (roughly… speaking).

How is this done? Well, they are using artificial intelligence, and more specifically they have trained a neural network with some 100,000 faces and related voices. Out of that the software has developed a knack for associating certain characteristic of a voice to certain facial features. In facial recognition the software identifies a number of points in a face (20 to 40 points are generally providing sufficient data to identify a face) and look at the relative distance among these points creating a vector pattern. What the machine learning algorithm did was to associate voices to specific pattern and on hearing a new voice the software looked for patterns generating similar voice characteristics and worked out an average that would reasonably match that voice.

I found this result amusing, on the one hand, and amazing on the other. I do not know how my brain works to imagine a face when it hears a voice but I would suspect that it goes through a similar pattern matching based on experience.  I wouldn’t have imagined that we could move from speech2text to speech2face!

About Roberto Saracco

Roberto Saracco fell in love with technology and its implications long time ago. His background is in math and computer science. Until April 2017 he led the EIT Digital Italian Node and then was head of the Industrial Doctoral School of EIT Digital up to September 2018. Previously, up to December 2011 he was the Director of the Telecom Italia Future Centre in Venice, looking at the interplay of technology evolution, economics and society. At the turn of the century he led a World Bank-Infodev project to stimulate entrepreneurship in Latin America. He is a senior member of IEEE where he leads the Industry Advisory Board within the Future Directions Committee and co-chairs the Digital Reality Initiative. He teaches a Master course on Technology Forecasting and Market impact at the University of Trento. He has published over 100 papers in journals and magazines and 14 books.