Lots of interesting stuff mentioning sampled phrases in sports game commentary and AIML deficiencies, amongst other things
I suppose I'm talking about a halfway house between the two.
AIML = no audio control, total content control.
Sampled speech = total audio control, no content control.
There are speech recognition and synthesis technologies that lie between the two extremes, allowing a lot more fine-grained control over individual syllables which results in a far more natural sounding voice, but whoever markets this stuff online should be fired on the spot!
I will just concentrate on synthesis for the purpose of this mini-rant. These companies spend all that money/time/brains on getting a decent solution together, and then they present their stuff to the public with plain text, unprocessed, 100% reliant on the synthesis engine's rule set to guess how the sentence should sound. These things always do a lousy job, because it's a very hard problem to solve.
So don't! Sidestep it with speech-to-text technology. You can easily get stuff (I think OS X even had it built in at some point) that will take some audio and convert it to phonetic text+markup info.
If these synthesis companies hard coded the pitch/timing stuff, they could do a decent job of fooling people that they were listening to recorded speech.
AIML is not a dedicated speech synthesis markup language. That would be SSML, which has markup elements for phonetics, pitch contours, timing, emphasis, and other stuff.
I'll make this point again as it's important, I think: a game's speech engine doesn't need to be clever. It doesn't need to cope with endless reams of random text. It can be hard coded for the most part (with spaces for the bits that will change as described previously).
It might be a bit samey after a few hours, but there are ways to alleviate that.
One way is quantity - if you're only storing phonetic text+markup you can go wild on the number of variants you have, thereby providing the listener with the sense that it's not just a few canned phrases.
You could even add random variation into the pitch contours and timing, just enough so that a phrase may be delivered a tiny bit differently each time, but not enough to make it sound weird.
There is no off-the-shelf product to do all this, but there is technology out there that can do layers of the stack required. Frontier have some clever folks, and Braben already mentioned speech being the next big thing to work on in games. Why not start now?