The Verge has, in combination with Nuance, the foremost leader in text-to-speech systems for things like Google Voice and Apple Siri, created this wonderful mini-documentary of how text to speech actually works.
It's well worth watching ...oh, and you get to see Siri in real life:
What they do is pretty amazing. But if you are like me, you also quickly see that it isn't the future. This is text-to-speech 1.0 ... and the next generation services will be nothing like it.
Let me explain:
Think about 3D for a moment. As you might know, there are two ways to simulate a 3D object. There is the rather suboptimal way in which you take a picture of an object from several different sides and then use a computer to stitch those images together to simulate a 3D shape.
This can be pretty amazing, like in this video:
The problem with this approach, however, is that you are limiting the quality and flexibility of your output by your input. In other words, you can only create variations of what already exists ... or nuances.
This is exactly what Nuance is doing with text to speech. While they are doing a fantastic job, they are being limited by the input of voices. They are creating speech by taking existing recordings, chopping them up into tiny sounds bites, and then putting them back together again,
And, it's pretty good.
But, what if you wanted Siri to sound more like your girlfriend ... or a celebrity? Or what if you wanted it to use the same voice as the person in the marvelous audiobook you just listened to? Well, then you are out of luck. Nuance can't give you a different voice. They are limited to only the voice that they have recorded. They are limited by their input.
This also means that in order for Nuance to get maximum amount of flexibility, they need to do two things. First, the voice actor needs to put in a ton of work providing them with as many alternatives as possible. But at the same time, she also has to sound the same with every sentence. This is why when she is reading different sentences aloud, she sounds monotone and 'average'.
Sure, she is doing a good job, but the format dictates monotony. You wouldn't want to listen to an exciting audiobook using a voice so devoid of feelings and excitement. The very way text-to-speech work is what makes her sound... flat. If she spoke with excitement, the bits wouldn't fit, and it would sound even worse.
So back to 3D. The other way to create 3D is to simply build it ... from scratch. This way you are not limited by your input, but can create anything you want. You can even create objects that have no basis on real things.
In terms of 3D, this gives you an entirely new playing field. Not only can you now create whatever you want, you can also do it at a level of quality and flexibility that puts it far beyond anything you could do with image stitching.
You can even make it looks so real that you can't tell the difference.
And this is the future of text-to-speech 2.0. It will not be on based mixing existing sound files by synthesizing voices from scratch. We will no longer be playing back speech files, we will create them ... build them.
And with this, we can venture into a whole new field in which text to speech becomes indistinguishable from the real thing, because it is real. It made the same way you and I make them in our throats, by creating sounds at different modulations in real time.
And once you start to think of the future of voice this way, you also start to realize that what Nuance is doing today is extremely limited. Even as they improve their technology and the level of detail and nuances in their recordings, they will always be limited by what they already have. They are defined by their input.
In the coming years, they will probably get a lot better, but I don't think they will ever reach the point of quality that we are used to when listening to audiobooks done by real people.
But the next generation text-to-speech systems will. They will created speech, not simulate it.
So who will make this?
In the media industry, we often talk about who will make the next newspaper, or who will make the next social network. The answer is almost always "someone else".
The same is true in this case. Nuance, being the world leader in text-to-speech, should also be the foremost candidate for developing the future. They already have an amazing head start. They already have the market. They already have the linguists and the amazingly talented people whom they need to make it happen.
But my prediction is that they won't. When you see the video, you also quickly realize that every person in it is defined by the old way of doing things. Their entire livelihood, their jobs and their roles are defined by their existing concept of manipulating voice files.
While they probably see the future as clearly as you and I, the cultural friction caused by sticking to what they have today will probably force them into the same situation that we saw with Kodak and lately Reuters.
The future of text-to-speech is most likely going to come from an outsider. And it's exactly the same thing we are currently seeing in the media industry. The trends lines are clear, but they are not actually happening within the existing industries.
In any case, the future of voice is going to be amazing. Just imagine what would could do if voice could be generated with the same quality and flexibility as 3D. Think of how that would influence how both brands and media communicate with people.
Voice today is mass-market based. But in the future, it would be just like any other form of media. Target-based.
Founder, media analyst, author, and publisher. Follow on Twitter
"Thomas Baekdal is one of Scandinavia's most sought-after experts in the digitization of media companies. He has made himself known for his analysis of how digitization has changed the way we consume media."
Swedish business magazine, Resumé