Proposal Discussion Radio Stations in Elite Dangerous

Would you like to have Elite Dedicated Radio Stations available @ your SpaceShip

  • Yes that SOUNDS good.

    Votes: 105 84.7%
  • No I Dislike it.

    Votes: 19 15.3%

  • Total voters
    124
  • Poll closed .
Lots of interesting stuff mentioning sampled phrases in sports game commentary and AIML deficiencies, amongst other things

I suppose I'm talking about a halfway house between the two.

AIML = no audio control, total content control.
Sampled speech = total audio control, no content control.

There are speech recognition and synthesis technologies that lie between the two extremes, allowing a lot more fine-grained control over individual syllables which results in a far more natural sounding voice, but whoever markets this stuff online should be fired on the spot!

I will just concentrate on synthesis for the purpose of this mini-rant. These companies spend all that money/time/brains on getting a decent solution together, and then they present their stuff to the public with plain text, unprocessed, 100% reliant on the synthesis engine's rule set to guess how the sentence should sound. These things always do a lousy job, because it's a very hard problem to solve.

So don't! Sidestep it with speech-to-text technology. You can easily get stuff (I think OS X even had it built in at some point) that will take some audio and convert it to phonetic text+markup info.

If these synthesis companies hard coded the pitch/timing stuff, they could do a decent job of fooling people that they were listening to recorded speech.

AIML is not a dedicated speech synthesis markup language. That would be SSML, which has markup elements for phonetics, pitch contours, timing, emphasis, and other stuff.

I'll make this point again as it's important, I think: a game's speech engine doesn't need to be clever. It doesn't need to cope with endless reams of random text. It can be hard coded for the most part (with spaces for the bits that will change as described previously).

It might be a bit samey after a few hours, but there are ways to alleviate that.

One way is quantity - if you're only storing phonetic text+markup you can go wild on the number of variants you have, thereby providing the listener with the sense that it's not just a few canned phrases.

You could even add random variation into the pitch contours and timing, just enough so that a phrase may be delivered a tiny bit differently each time, but not enough to make it sound weird.

There is no off-the-shelf product to do all this, but there is technology out there that can do layers of the stack required. Frontier have some clever folks, and Braben already mentioned speech being the next big thing to work on in games. Why not start now?
 
That's the problem, Armour.....the tech that you are describing, is a theoretical middle of the road solution between the technologies that exist already.

Conceptually it sounds right but there are implicit technical limits to what you can achieve.

AIML is at the core; AI; it is used to teach virtual assistants about something, and then while they "talk" to others, they "learn" more nodes, looking more "smart". I was working on a solution to give a voice to such entities; and the task was not that simple.

This sounds pretty good; I am tempted to give it a try:
http://www.oddcast.com/home/demos/tts/tts_example.php?sitepal

Try to have the virtual speaker to read this: "And now, let's try some news! Today we found some dead cats in the pool, and was not a good sight. The worst part was the clouds that came in after lunch".

For short sentences, it read them with the correct intonation; if something like this can be added to the game, maybe it may work.
 
It's not theoretical - it hasn't been put together for a game yet, but the technology exists in the real world to do so.

Speech synthesis markup languages exist.

Speech engines exist that can be tightly controlled (with the aforementioned markup languages) to produce more human sounding output than the norm, and even reduce the CPU overhead (because all the clever stuff is pre-baked).

Speech recognition engines exist, and can produce detailed output that can be tidied/converted into what would be needed.

These are solved problems. The only tricky part is the middle bit, whereby the stored markup phrases are spliced together with the relevant PG content (star system names, character names, times/dates, etc). And that is just regex stuff, really.
 
It's not theoretical - it hasn't been put together for a game yet, but the technology exists in the real world to do so.
....

We are running around, without really going anywhere with this :) Technology exist, but specific implementation for our context does not.

I see from the use of the terms that you make, that there is a bit of confusion (probably you are researching trough wiki, which can be daunting for non-programmers of this specific technology); let me see if we are talking of the same things :

TTS: Text to speech; the overall "engine" that allow the computer to translate a text in a audio output as result. It is equivalent to the term "speech synthesis". Apple and Microsoft each has their own, plus there are a ton of 3rd party solutions.
It is made by different phases/units, in which you do part of the process:

- Text normalization phase (the one that convert from symbols to a textual form, like abbreviation, punctuation, exclamations and so on)
- Phonetic transcription phase ( where text is converted in phonemes and gets divided in prosodic units; which is what you call "speech engine" I think)
- Text to phoneme (the process that relate phonemes to the converted words from the phonetic transcription, including tone, emphasis and such).
- Audio generator (the one that translate in a audio format for output, what the previous phases did).

SSML is a markup language for browser based application of TTS engines, since 2004 (before they used JSML I believe).
What SSML do is covered in the "phonetic transcription" part of the TTS engine trough API (Microsoft, Apple or 3rd party products).
Unless you need to write a web app that run in a browser, or in conjunction with an interactive voice recognition system (automated phone customer service), you will probably skip SSML.
It was made to run web apps in browsers, and since you can't control TTS trough browser, SSML was made to give you the chance to access the API via browser.

You don't need regex for splitting; you pass sentences to the TTS in the measure that you believe is right, via the correct API. The whole logic of what to say is based on what you pass.
If the voice has to say something, you pass the command and the string to say; then you pass the parameters related to how you want the intonation and everything else, and that's all that is required :)

Speech recognition is the process inverse to TTS: it translate audio (as words) into something that the computer can understand and then in text. Abbreviated in STT. Unless you want to give vocal commands to the ship, you won't really use it.
TTS engine record phonemes or words as audio file, and then create an association via code to know what A sounds like, or what CH sounds like. Then it is up to the text to phoneme part of the TTS engine to put together everything for the output; so you don't use STT for that.

So, what is exactly the product that you are mentioning? Would you make a whole TTS engine from scratch? That would be OK, but there are already plenty out there. And can't do better than these, like I can't start a car company tomorrow, and make cars better than who makes them since long time :)

The complexity, once again, is in how you have the universe news read by this voice, and how you actually gather them. If we assume that the pods that you mentioned are the messengers for the whole universe, then we will have a set of sentences to read; and the result may vary. Can it be done? Yes; would it sounds realistic? Depends. Is it worth to spend time on it? I don't know, but in my opinion wouldn't be as good as real person speaking.

We still have to figure out how these news gets created thou :) We have the how they are transported and how you convey them, but how are they in fact written as text? Randomly generated?

Since now I've got the spark, I want to see what can be actually done; writing a client now to see how it may possibly work.

In case you want to know how to program TTS API for Win and OSX:
Microsoft:
http://msdn.microsoft.com/en-us/library/hh362831(v=office.14).aspx

Apple:
https://developer.apple.com/library...n.html#//apple_ref/doc/uid/TP40004365-CH1-SW1
 
I guess voice intention, modulation, (realistic) would be a NICE TO HAVE feature, but for News and stuff like that the Default IVONA's voice would be nice for me.

Guess production is the process with more things to be planned.
 
We are running around, without really going anywhere with this :) Technology exist, but specific implementation for our context does not.

Yes, but implementation would be a lot simpler than you were making out earlier, I think.

I see from the use of the terms that you make, that there is a bit of confusion (probably you are researching trough wiki, which can be daunting for non-programmers of this specific technology);

Not exactly, I was playing with controlling speech synthesis via code on the BBC Micro back in the 80s, then on the Amiga in the 90s (where I also did some parsing and text manipulation in the context of interactive fiction), and finally on the PC in the late 90's/early 00's (where I got sidetracked by other stuff like an interest in automated translation of language, but I digress). But I get that the jargon I use is a little different than it would be if I had done a university course on it. Anyway...

let me see if we are talking of the same things :

TTS: Text to speech; the overall "engine" that allow the computer to translate a text in a audio output as result. It is equivalent to the term "speech synthesis". Apple and Microsoft each has their own, plus there are a ton of 3rd party solutions.It is made by different phases/units, in which you do part of the process:

- Text normalization phase (the one that convert from symbols to a textual form, like abbreviation, punctuation, exclamations and so on)

Yes. A lot of other use cases are also considered in this pass, like how to accurately translate "3.21" vs "£3.21" vs "£3.21 million", and so on. But we can ignore this pass entirely.

- Phonetic transcription phase ( where text is converted in phonemes and gets divided in prosodic units; which is what you call "speech engine" I think)
- Text to phoneme (the process that relate phonemes to the converted words from the phonetic transcription, including tone, emphasis and such).

The resultant data from these passes would be pre-generated as a result of applying speech recognition to the output of a voice actor, collecting the result of the first couple of passes from that process, and storing it for later use.

- Audio generator (the one that translate in a audio format for output, what the previous phases did).

I think this is the trickiest bit of the lot if it is to be done right, but still not that hard.

SSML is a markup language for browser based application of TTS engines, since 2004 (before they used JSML I believe).
Yes, it was just used as a real example of the kind of markup language I was talking about. The one used in the game engine wouldn't need to be so general and bloated.

You don't need regex for splitting; you pass sentences to the TTS in the measure that you believe is right, via the correct API. The whole logic of what to say is based on what you pass.

Again, a shorthand for saying "leave placeholder tags in the pre-generated dialogue strings that can be replaced at run-time with the events/place/people that the piece of dialogue is about". Regex processing is not required - although the code already exists and doesn't cost anything to use AFAIK.

Speech recognition is the process inverse to TTS: it translate audio (as words) into something that the computer can understand and then in text. Abbreviated in STT. Unless you want to give vocal commands to the ship, you won't really use it.

Unless, of course, you want to generate large amounts of dialog, correctly intonated and timed, quickly and relatively inexpensively, without having to hand code all the pitch contours and timings, or get a TTS engine to try and handle it algorithmically...

So, what is exactly the product that you are mentioning? Would you make a whole TTS engine from scratch? That would be OK, but there are already plenty out there. And can't do better than these, like I can't start a car company tomorrow, and make cars better than who makes them since long time :)

I'm not suggesting building it all from scratch, but I am suggesting that a clever developer could take the useful parts of each, and "Frankenstein" them into something that allows them to achieve their aim.

The complexity, once again, is in how you have the universe news read by this voice, and how you actually gather them. If we assume that the pods that you mentioned are the messengers for the whole universe, then we will have a set of sentences to read; and the result may vary. Can it be done? Yes; would it sounds realistic? Depends. Is it worth to spend time on it? I don't know, but in my opinion wouldn't be as good as real person speaking.

But it would essentially be a real person speaking, for the most part. It would be using the words a voice actor said, and the manner in which they were expressed, as a program to send to the synthesiser's output. It might not be their exact voice, but even that could be approximated if the timbre was approximated algorithmically.

We still have to figure out how these news gets created thou :) We have the how they are transported and how you convey them, but how are they in fact written as text? Randomly generated?

Random selection of relevant canned phrases (lots and lots of them), with placeholder tags left in for the pertinent, dynamic bits of info. That's where the stuff that text-based IF (Interactive Fiction) has to offer comes in.

Since now I've got the spark, I want to see what can be actually done; writing a client now to see how it may possibly work.

Good luck!
 
Armor; I think that if something is not happening is because it takes too long, is not useful or simply can't be done...we may go around the topic, but the facts are these...companies spend a fortune do do voice over in games; because there are no alternatives that has the same results and richness in outcome.

The tech may be there, but we have theoretical tech to do cold fusion too...so far in 30 years it didn't happen yet; and there must be a reason, since we have the principle and the tech.

BTW BBC and Amiga had a rudimentary phoneme converter and vocal synthesizer, they were good for their time, like what we have today is good for our times, but won't fool anyone to believe that it is a real person speaking :)

The resultant data from these passes would be pre-generated as a result of applying speech recognition to the output of a voice actor, collecting the result of the first couple of passes from that process, and storing it for later use.

You lost me here...the resultant data of phonetic transcription and text to phoneme can't be pre-generated...because this is the output that goes to the audio generator. They work in sequence, is a chain, so how can you pre-generate this, if it gets calculated based on the text that you pass to the text normalization phase?

Text->text normalization->phonetic transcription-> text to phoneme-> audio generator

A human being could not pre-generate it, unless it does it by hand, which has the same utility as a person that write code in pure binary code instead than Assembly. I believe that in 100 years there were 3 people that did that :)

You use rules in the text to phoneme, to customize the result of what the previous phases make, but mostly this is the peak of the human interaction in the process...if you need to do more work, then the TTS engine that you are using is pure garbage :)

As loose analogy, imagine you cranking your car, because you want control on the electric energy that you send to the engine; instead than use a a battery and a starter and just turn the key to start the car :)

I get overall your idea, and since you used to do this since the 80s, I am pretty sure that you can write some code to implement this. I can show my solution and let's see if we can integrate both, because I have no clue how would you implement what you describe XD Spent the past week looking into this and talking with people at the university and nobody had a clear idea of how to achieve this. Beyond writing the whole TTS engine parts, only for this purpose.

Which language do you use? Are you OK with C++? That's the only one that I am more fluent with...once I used to write assembly for 68K on the awesome ASMOne....now I don't even remember how you do bit shift on a register :p

BTW we can't "frankenstein" it, because each TTS engine does not expose their internals, they just offer what the programmer deemed important to be exposed; so you can't actually access just the code for the text normalization, phonetic transcription, text to phoneme or audio generator...we need to write them all from scratch
 
Last edited:
Hell yeah!

*Intones* "ALL the Hits! ALL the time! 93.4...Radio Reidquat...
You're listening in Colour!"

HIRED!

FRontier will send someone to pick you up and incarcerate you for a couple of years, so you can be the "totally voluntary" DJ meister of the Elite radio channel :D

Then someone else will take your seat "voluntarily" XD
 
You lost me here...the resultant data of phonetic transcription and text to phoneme can't be pre-generated...

These are some example steps to pre-generate that data, I would think:

1) Get a readable screen.
2) Get a human being who is able to read.
3) Sit the human being in a chair in front of the screen.
4) Point a microphone at their mouth.
5) Display the words "Read the following text:", followed by the source text, on the screen.
6) Record the audio.
7) Use speech recognition to convert the recorded audio to phonetic text + pitch/timing data.

(The output steps, should be fairly self-explanatory.)

As for me coding - I already have a job. I might consider tinkering with CMU Sphinx a bit to see if I can get anything usable out of it on the recognition side, but that would be on my own time, so I can make no promises about how much of that precious time I would be willing to devote to said tinkering - especially after the end of Q4 2014!
 
These are some example steps to pre-generate that data, I would think:

1) Get a readable screen.
2) Get a human being who is able to read.
3) Sit the human being in a chair in front of the screen.
4) Point a microphone at their mouth.
5) Display the words "Read the following text:", followed by the source text, on the screen.
6) Record the audio.
7) Use speech recognition to convert the recorded audio to phonetic text + pitch/timing data.

(The output steps, should be fairly self-explanatory.)

As for me coding - I already have a job. I might consider tinkering with CMU Sphinx a bit to see if I can get anything usable out of it on the recognition side, but that would be on my own time, so I can make no promises about how much of that precious time I would be willing to devote to said tinkering - especially after the end of Q4 2014!

I see, so this has nothing to do with what I was talking about.

I was specifying the different phases in the TTS conversion, while you are talking at high level of what the system should do.

In fact, following what you wrote, you are not feeding data to the phonetic transcription and text to phoneme section of the TTS engine, but you are preparing the data that will be then send to the audio generator phase.

That is the phase where the phonemes recorded are retrieved and associated with the output of the phonetic transcription and text to phoneme phases.

Yes, we have that already; the complex part is to concatenate the recorded audio, and associate it with the specific text :) this is what I was posting on my first reply; I guess that now we are on the same "station" (pun intended).

I have a job too; and a family with kids; so I hear you. Don't worry, the game is not going anywhere for at least 6 months ;) Whenever you have time, put something on Github and I will do the same.

BTW I am not sure why you mentioned Sphynx; since that one is a voice recognition framework, which has nothing to do with a TTS engine. It talks back but it sounds like an amiga from 1986....I thought that you mentioned that we have already technology far superior to that one, so what use is sphinx for?
 
BTW I am not sure why you mentioned Sphynx; since that one is a voice recognition framework, which has nothing to do with a TTS engine.
<SNIP>
what use is sphinx for?
I thought that would be plain, based upon your reply. I will quote my previous post:
These are some example steps to pre-generate that data, I would think:
<SNIP>
7) Use speech recognition to convert the recorded audio to phonetic text + pitch/timing data.
 
Ah, right.

But then we discussed that it is not needed to do so, since you would put the tone, emphasis and emotional contenti via code.

The idea was to feed a text file with all that has to be transmitted, to the engine that will then read it; so the part where someone reads it and it gets converted in text is redundant at that point
 
Ah, right.

But then we discussed that it is not needed to do so, since you would put the tone, emphasis and emotional contenti via code.

Whenever did I write that?

The idea was to feed a text file with all that has to be transmitted, to the engine that will then read it; so the part where someone reads it and it gets converted in text is redundant at that point

Eh? You still need the phonetic and prosodic data. That would be the whole point of doing it this way.
 
Whenever did I write that?



Eh? You still need the phonetic and prosodic data. That would be the whole point of doing it this way.

You wrote about the probe that travel the universe bringing news...how that would work? If the probe has an audio message, then what's the point of translate it in text and then re-convert it into audio?

The phonetic and prosodic data is coming from the text itself.....ahem...this is what the TTS engine does :O

Otherwise it is like to use a battery to power a lamp, that will illuminate a photovoltaic cell that recharge a battery...you skip the whole process and just use the battery that you already have :)

With our current technology level, there is no need to have someone read a sentence to put the emphasis in the speech; that's why I was confused all the time about your approach.
 
Well just imagine the revenue if elite also turn into a real world market place for advertiser and online purchase to happen in-game the radio just a voip ingame advertising media. Elite could turn into a space online shopping mall ... this time though it's buying real world stuff in an in-game virtual world. Kinda exciting and would bring good revenue to frontier and continual progressive development of the Elite world. I am sure by then running a 20 person team to do this isn't a problem after all ... :D
 
The phonetic and prosodic data is coming from the text itself.....ahem...this is what the TTS engine does :O
...
With our current technology level, there is no need to have someone read a sentence to put the emphasis in the speech; that's why I was confused all the time about your approach.

I disagree with that last bit, every time I hear TTS trying to read text it sounds robotic - not because of the voice quality at all, but because it just sounds "off", the prosody is usually the issue. That's why you get that data from a real human.

And the reason you do it that way is because you can use the phonetic text as a basis for custom content in a way you can't do with audio. Plus it will take up a lot less space.
 
Last edited:
I disagree with that last bit, every time I hear TTS trying to read text it sounds robotic - not because of the voice quality at all, but because it just sounds "off", the prosody is usually the issue. That's why you get that data from a real human.

And the reason you do it that way is because you can use the phonetic text as a basis for custom content in a way you can't do with audio. Plus it will take up a lot less space.

What makes it sound robotic is the TTS engine that does not have a good sampling for the phonemes, or phonetic transcription and text to phoneme section that are poorly configured.

If the audio is broken while the sentence is played, then the issue is in the text to phoneme; while if it sounds emotionless, that's because the parameters passed to the phonetic transcription phase are not correct.

A human won't fix the problem, since in the conversion from speech to text, you introduce a whole array of possible errors and variables.
One thing is to train your computer to understand your voice, tone and pronunciation, and another is to use a TTS to read a text with correct emphasis and with emotion.

From a brief search, nobody use this approach for good reasons; once that the TTS engine is correctly configured; you control the parameters using punctuation and other symbols. I posted a link to a TTS API that was correctly interpreting the mood of the conversation, based on how do you set the punctuation; simply typing in the sentence. Check the link that I posted few posts ago, and you can hear how a good TTS engine work, when correctly made :)
 
I like the idea of having GTA-style in-universe radio stations, though they'd all have to be classic FM's considering how far in the future the game is set.

However, I have to admit, when I read the title I was actually thinking radiostations as in radio communication channels, allowing some form of group voice chat regardless of where you are, by chuning your comms into a specific frequency. And that I find a little more interesting, to be honest.
 
What I would love to have is some sort of news broadcast that you can tune into if you want, like a 5 minutes news broadcast every hour, on the hour.
This could be fairly easily recorded by an employee of Frontier on the morning each work day (or possibly 2 or 3 different people on different schedules taking turns to make it every day so we get news everyday and are not limited to one persons schedule), or maybe twice a day to get some variation).
The news should be mostly fabricated, but also include stuff that we've seen in the newsletter, so stats, what seems to sell and where, a flair of piracy at some stations et.c.

Some jibber jabber talk show would be great as well, but then we come to a more serious solution with people who will do that on a more regular and longer basis, and I'm not sure if that's so feasible.
It would be cool though.
 
What makes it sound robotic is the TTS engine that does not have a good sampling for the phonemes, or phonetic transcription and text to phoneme section that are poorly configured.

If the audio is broken while the sentence is played, then the issue is in the text to phoneme; while if it sounds emotionless, that's because the parameters passed to the phonetic transcription phase are not correct.

Yes, I know about the causes of robotic speech. Back in the old days it could equally be both deficiencies. Today, I find that it's usually the data being fed in as opposed to any inherent deficiency in the voice quality. Hence my thoughts about feeding in an accurate version of how a human being would say each sentence. I recently made an enquiry to someone working commercially in the field of speech recognition/synthesis about this idea, so I'll see what they have to say on the subject too.
 
Back
Top Bottom