Will Synthetic Game Voices Be Viable By 2020?

TLDR: Probably not :/

But the tech is still improving rapidly. So thought I'd have an annual (amateur) snuffle through the possibilities :)


On FDev's Interest in Synth Voices:

  • The audio team suggested synth voices were desirable back in September 2015, but that the tech wasn't there yet.




Some Current Trends:

  • Some ingenious stuff going on in academia. Mimicking voices from 5 seconds of source material, and check out the 'Fictitious Speakers' section for some great dialled up voices. (All cool, but doubtless not ready for prime time yet).
Source: https://www.youtube.com/watch?v=0sR1rU3gLzQ&t=40s





What Tech Would ED Need?

If they were looking to voice mission NPCs and the like, probably a fair bit (I suspect)...

  • A broad spread of distinct voices. (Ideally 'dialled up' ones, leading to massive variety).
  • Passable intonation, 'prosody', pronunciation etc. (And hey, any oddities can be put down to future speech ;))
  • Low-ish generation costs on client machines. (I kinda doubt they'd go with cloud streaming, given the likely bandwidth requirements & costs).
  • Multi-language support ideally.

Which kiiiiinda rules out all of the above solutions ;)

The mission video from the files also seems to use placeholder bespoke audio. (The flat tone does actually make me wonder if it could be TTS :D. But the incidental background noise makes me think probably not, unless it was test audio which had assimilated some noise. Probably a stretch :/)

Ultimately I'm guessing synth voices are still a pipe dream. (Galnet robo-speak aside). I'd still be well up for it though :D

EDIT:

Possible Option?

The MelGAN system by Lyrebird seems interesting. Can generate voices from source audio, sounds great, and... is possibly useable at runtime on domestic machines?
 
Last edited:
In some cases this would be viable, but voice acting is so much more than a natural sounding voice.
An AI would need to understand emotion and context to be really useful.
 
In some cases this would be viable, but voice acting is so much more than a natural sounding voice.
An AI would need to understand emotion and context to be really useful.

Well there are really fanciful future possibilities for sure (like having convincing audio exchanges with NPCs, comparable to Google Duplex at its best). But I think there are various steps along that path would still be fun additions in their own right. Synth-voiced NPCs in a proc gen world could be preferable to essentially silent ones, or very repetitive ones, or ones that communicate only via text etc. (And it doesn't mean that key bespoke missions couldn't be properly voiced, as now. Just the generic menu missions, any 'radiant' pop up ones, and background chatter, could benefit from it potentially).

There's loads of snazzy tech it'd be great to have to improve NPCs generally. (Context-appreciation being one of Brabes's old interests. So never know, they might have a stab at that ;))
 
Galnet audio is text to speech just in case. Its the best i've heard thats available today (and you can buy it for your windows pc).

Yeah mentioned it in the OP :)

It’s still too robotic for NPC use though I’d say. Even for fuzzed up ship-to-ship voice or vid comms. (Unless they realllllly fuzzed it up, using connection breaks to mask the intonation breaks and heavy static to mask the lack of personable voice timbre etc)

If you take the Google Wavenet stuff for example, it’s more personable. (And those academic samples? Damnnn). They all falter at length, but it feels like we’re not too many years from pseudo-convincing dial-up NPC banter ;)
 
Yeah mentioned it in the OP :)

It’s still too robotic for NPC use though I’d say. Even for fuzzed up ship-to-ship voice or vid comms. (Unless they realllllly fuzzed it up, using connection breaks to mask the intonation breaks and heavy static to mask the lack of personable voice timbre etc)

If you take the Google Wavenet stuff for example, it’s more personable. (And those academic samples? Damnnn). They all falter at length, but it feels like we’re not too many years from pseudo-convincing dial-up NPC banter ;)

Funny thing is its a bit slurred coming out of elite.. guessing they just record a low quality mp3 of the output. With the native engine running on your pc its sounds clearer. While not human voice, if you don't listen for it, it won't remind you its a text to speech engine, which is pretty good.
 
Funny thing is its a bit slurred coming out of elite.. guessing they just record a low quality mp3 of the output. With the native engine running on your pc its sounds clearer. While not human voice, if you don't listen for it, it won't remind you its a text to speech engine, which is pretty good.

Yeah it is a good one, don’t get me wrong :)

Personally I think it’s still a bit too wooden for live character use though. It’s got some nice voice timbre in there, better than most ‘low cost’ ones for sure, but I think it needs more on that front. (To smooth over the nastier stop-start clause bridgings and bizarro syllable stresses that you get in all of the systems, to varying extents).

Possibly they’ve just gone for a very dispassionate, news-reader-y character, but the personalised ‘prosody’ of other TTSes work better for me (IE the individualised gulps, umms, sniffs, lilts, phrase bunchings and timings etc). That’s the extra secret sauce that could make up for a lot of the duff pronunciations and awkward clause clumpings.

If it could be done at run time ;)

Compare the in-game sections here:

Source: https://www.youtube.com/watch?v=P7c5JCigp-U&t=19s

With these examples:

Source: https://www.youtube.com/watch?v=0sR1rU3gLzQ&t=40s

Think the latter is closer to quality they’d want / need to hit. (Even if that’s probably not do-able right now at length & with live processing :/)
 
Last edited:
Yeah it is a good one, don’t get me wrong :)

Personally I think it’s still a bit too wooden for live character use though. It’s got some nice voice timbre in there, better than most ‘low cost’ ones for sure, but I think it needs more on that front. (To smooth over the nastier stop-start clause bridgings and bizarro syllable stresses that you get in all of the systems, to varying extents).

Possibly they’ve just gone for a very dispassionate, news-reader-y character, but the personalised ‘prosody’ of other TTSes work better for me (IE the individualised gulps, umms, sniffs, lilts, phrase bunchings and timings etc). That’s the extra secret sauce that could make up for a lot of the duff pronunciations and awkward clause clumpings.

If it could be done at run time ;)

Compare the in-game sections here:

Source: https://www.youtube.com/watch?v=P7c5JCigp-U&t=19s

With these examples:

Source: https://www.youtube.com/watch?v=0sR1rU3gLzQ&t=40s

Think the latter is closer to quality they’d want / need to hit. (Even if that’s probably not do-able right now at length & with live processing :/)

Think you've got the wrong game, should probably be hitting up the star citizen forums? They're just spending random's money and don't have to produce anything so next level text to speech is probably a good fit over there.

Yeah that would definitely be impressive. You could have a conversation with your computer.
 
Yeah, it would be great for games with lots of NPCs, especially for RPGs and elite where they could deliver more missions with voice...but I do feel bad for the VA's that will probably get less work.
 
I hope so. Otherwise Galnet Audio was a huge waste of development time, there are hardly any articles now anyway.
 
Think you've got the wrong game, should probably be hitting up the star citizen forums? They're just spending random's money and don't have to produce anything so next level text to speech is probably a good fit over there.

Yeah that would definitely be impressive. You could have a conversation with your computer.

Umm what? (Did I insult your girlfriend or something? :D)

Saying that the in-game Galnet is still on the robotic end of TTS, and so a hard sell for NPCs, isn’t particularly controversial.

(Maybe it’s a second language thing, or the version in your language is better, but as an English speaker the intonation / syllable stress / pronunciation in Galnet is still classically TTS clunky to my ear. Which is fine for that use case, but not much beyond that...)

I’m also saying the better versions don’t seem to be ready for commercial gaming use, so I wouldn't expect them in 2020. So I’ve no idea why you’re throwing SC shade at me ¯\(ツ)
 
Yeah, it would be great for games with lots of NPCs, especially for RPGs and elite where they could deliver more missions with voice...but I do feel bad for the VA's that will probably get less work.

Yeah the job loss angle could be tricksy if it comes. Hopefully it’ll be a case of new job roles being created, rather than Luddite smash ;)

(Being able to create various interesting, consistent character voices is still going to be a desired skill I reckon. Even if as the ‘seed’ for replication. Plus you could imagine tech-savvy voice actors being well placed to fine tune output.)

Chances are straight audio capture will still be the main narrative form for a long long time to come though ;)
 
Ok, this is gonna make some "Elite 4" vets roll their eyes hard, but I just stumbled onto this ancient GDC talk by Brabes, and found it an intriguing window into his old areas of interest. Some of which might still be relevant today ;)

'Technology Five Years From Now' [GDC 2001]

Some excerpts of note (I've bolded the synth speech bit)

On Speech [45m07s]:

Now actor speech is inflexible. It suffers from similar issues to animations. Essentially you can't vary the tone or inflexion with context very readily on actor speech without making it sound completely bizarre. It can also only be used for one character because we're so good at recognising individual actors. If you use it repeatedly you need to record a number of variations, if it's a phrase that is commonly [used], like: "I don't understand that" it really becomes, it grates. And if there's an interruption you either just cut them short or you continue their speech, it feels very unnatural. And that's not would happen in real life.

The other danger with this is it's a one-way traffic. Players are most definitely unable to respond in kind at the moment. They may well curse and swear at their machine but unfortunately it can't hear. And in fact one of the things that's often frustrating with a game, I think it's particularly obvious in games like Zelda, where the responses all lead to the same response anyway. It's either a 'I'll say yes now', or I'll pause and then I'll say yes.
Often the flow is single track. Now I accept it's very, very hard to open up a game world to that level of richness where you can have very, very broad tracts [?]. But it is a problem.

Now the other thing of course is that player speech with a microphone is useless without comprehension. Apart from in a multiplayer environment.

Now speech comprehension. Is that a solution? Now there are methods around for speech comprehension that work without training. And, to some extent, work in noisy environments. But they still, as far as the ones that I've seen, don't manage to come across the sort of subtle inflexions that convey sarcasm. And also, from the point of comprehension, getting sarcastic sayings that people use in common speech. People are still going to have to speak in a constricted way. And what it comes down to is that in order to do full speech comprehension, we need to solve the Turing test. Now, I think that's still a long way away...

But that's isn't a reason not to do it. I mean currently we can maybe do a few hundred words convincingly with a character. But that's a lot better than a menu with three entries. And even the games, you tend to forget that you're speaking in game speak, you know, the 'Go North' of adventures still were quite compelling. It is actually an option, in terms of filling in the black areas of the map at the start. Even though it's a poor solution, it may well be better than a menu. As long as it's reliable, as long as you don't have to keep saying the same thing over and over.

Now the flip side to that is speech generation, which is also quite hard. You do need it for tailored responses, to be able to vary tone and inflexion according to the player character, where they are, what they're doing. In other words if a character starts a piece of dialogue, and the player draws a gun, you don't expect the dialogue to continue, impervious. You either expect the guy to start running or hiding. Or at the very least sound rather worried, and say: "Yes I'll open the door for you!!", or whatever.

It's again, as I said at the start, it's an audio parallel to real-time animations in place of pre-planned. And there's a lot of benefit to it that's often not talked about. I think none of us, us included.... it is a frighteningly difficult problem. And the problem is that most of the solutions ends up sounding like Stephen Hawkings anyway.


+

On AI [51m30s]:

Most AI I see now is at best a scripted table. The problem with speech is it risks us revealing just how shallow our characters are [laughs]. And that is a real problem. Once you've got speech you've got a much higher quality of interaction, and the player sort of can get to understand what's behind the character. And at the moment they're so two dimensional, it, uh, it's hardly believable. And that dictates the styles of games. If you've got a game style where intrigue, or character relations, are practical. Where you're actually meeting characters face to face. I mean where their face might fill the screen, and you're chatting with them and negotiating with them or whatever. We've got an awful lot to work on before we can get there. An awful lot, I think.

Again we're back to the Turing test. But having said that, people know they're in a game. It doesn't have to be perfect. It just has to be better than it is now [laughs]. In order to have the sort of quality of relationship... I mean the advantage of using characters like animals, is people's expectation is actually lower. And I think, going against one of the things that I said, doing it in a fantasy environment helps as well, because you can have creatures that are presented as being less intelligent.

All a player really needs is a start. They want to believe. It's not like they're out to get us. It's just if we do glaring things, like repeatedly saying the same phrase, then people's imaginations can do the rest.

Now one of the things that I think is really important to this, is of all of the AI techniques I've seen, very few actually seriously consider the character's memories. And this is a very, very good way of getting a character to feel real. Now I don't consider a set of flags to be memories. You know, 'I met this character' or whatever. Which triggers a slightly different response, so it says: 'Hello again' or whatever. I mean that... that helps, but it's not the solution. It has huge storage implications. I mean the sort of things I have in mind is: If the character goes from their house to the town, cross a bridge, they go in to the town. They need to have a local memory of where they've been, rather than just axises pointing to the underlying map, because, let's say Mr Very Bad Person comes along and blows the bridge up, does that character immediately know in its route-finding algorithm that the bridge is blown up? And does this character go and walk across the other bridge if there's a route around? Or do they just not go home because they know the bridge is blown up? I mean that sort of thing is really a problem. Because it means essentially if you're allowing maps to be altered, every character should really, ideally have a local copy of the map. It's issues like that where you think: 'Oh yes, we do memory, we can just keep a list of known events', but essentially their journey is also a list of events.

Now without memory, lies and deceptions are very, very hard to cope with. I mean in fact lies and deception are very hard to cope with anyway, but we need to be able to have the player lying to the game, the game lying to the player. I know games have done lies, but they're done in a very, very shallow way so far, or at least the ones I've seen. And I think that's something that's worthy of a good amount of thought.


Seeing as he was so ahead of the tech in his desires then (and seeing as he seemingly tried to enact some of it in The Outsider), I wouldn't be surprised if he still has designs on some of this stuff ;)
 
Last edited:
Ok so Lyrebird by Descript is interesting.

As of Nov 2019 they were suggesting they could do this:

The MelGAN generator is capable of sampling at a frequency of 2500kHz on GTX1080 Ti GPU in full precision, which is more than 10x faster than the fastest competing model on the same hardware, and 100x faster than real time. Even more importantly, MelGAN is one of the few models, across all comparable ones in terms of output quality, that can afford real-time processing on a typical CPU


I’m guessing that ‘100x faster than real time’ is way more than FDev would need for a basic use case. If it could be scaled down to use less resources as a result, could this be viable on domestic computers at run time?

The output is pretty damn good (skip to the audio half way down).

Individual voices are trained on 15hrs of audio, so not totally ideal, but possible to brew up a big spread perhaps? Depending on how big the storage is for each voice? (Their other services build voices from far less source material, so maybe reductions will happen there too?)

Another intriguing service is the phrase insertion tech:

V4Vn9wU.jpg


Again sounds pretty great. Useful if you wanted to use bespoke voices perhaps, just inserting in daft system & faction names etc ;)

It’s not clear from their site if they’re targeting the games market, but it’s certainly intriguing tech :)
 
As a software dev, this technology is the coolest thing in the world to me. I've been spending time on my weekends trying to learn more about neural net development to try to hit a point where I could emulate something like this, as I've wanted to expand my home automation to utilize voice controls similar to how HCS voicepacks controls my ship in game. Eventually expand it more, but simply getting feedback from a voice that doesn't sound like Wall-E at first would be a huge step.

I'd give my left arm for a consumable library for this kind of thing that I could use offline, as opposed to just an online API I have to send text to and get an audio clip back.
 
Now if we only still had a Galnet that would be cool :LOL:

Maybe Galnet was just the proof of concept, never know ;)

(Seriously though, I’d be a lot less bummed about the Galnet downtime if it turned out the narrative guys were dialling up NPC dialogue for the DLC instead ;))
 
I would kind of hope NPC text-to-speech in Elite, were it to happen, to involve all template text coming with an authored phonetic companion, instead of messing with the client transcribing written language automatically. It would include markup for all sorts of things, modulating for different tones and moods, speed, strength, shout, whisper, emphasis, pauses, mumbling, and so on, and so on. Data derived from a character's seed value would slot in gender, dialect, other voice box characteristics, temperament, etc...
 
Back
Top Bottom