Well, try announcing the countdown yourself in a steady, calm manner. You will most probably find out that it requires a little bit less than a second to announce the number and the rest time is required for the steady and calm countdown.
I can count to ten out load clearly without rushing in about 3.5 seconds
so about 3/10 to 4/10 of a second to pronunciate each number.
The voice in the countdown is similar
I studied the countdown closely last night.
the first thing on screen is 5.000
at the same moment the voice begins saying "four"
she finished saying four about a third of a second later.
other numbers are similar
5.000 - 4.7000 "Four"
4.000 - 3.6000 "Three"
3.000 - 2.7000 "Two"
2.000 - 1.7000 "One"
1.000 - 0.4000 "Engage"
In order for the spoken numbers and countdown to be in sync
The countdown needs to be
5 , 4 , 3 , 2 , Engage ( with a pre boom engage )
or
5 , 4 , 3 , 2 , 1 , Engage ( with a post boom engage )
Personally I feel the hyper boom is a real nice effect and the "Engage" only detracts from it
I would go with
5 , 4 , 3 , 2 , 1 , ( let the boom speak for itself )
Of course there is no physical law to determine when a number is spoken to be in sync.
and it could be argued that the top digit 4.999 is what is being linked to with the "four".
with such a subjective topic we can only refer to what is expected due to cultural experience.
The most common countdown for most people is a countdown to new year or other happy event.
In such scenarios its typical for a number to be announced in such a way that
3 is said at the moment exactly 3.000 seconds is left
2 is said at the moment exactly 2.000 seconds is left
1 is said at the moment exactly 1.000 seconds is left
then the event happens ( new year )
everyone cheers momentarilly post event
If you want to design a statement such as "Engage" pre FTL that statement must replace the spoken "1" otherwise you will displace all the spoken digits one seconds out of sync from what is culturally expected for a countdown and make the countdown look out of sync.