TTS Engine Comparison

cyclicalobsessive · January 21, 2021, 4:53pm

I thought I might investigate a clearer sounding text-to-speech engine for Carl, but found out that clarity comes with a long wait:


( using phrase: "<engine-name> Hi. My name is Carl. How do you like this voice?" )
$ ./make_tts_samples.sh
*****
flite
real 0m0.565s
user 0m0.122s
sys 0m0.050s

*****
espeak-ng
real 0m0.283s
user 0m0.064s
sys 0m0.044s

*****
cepstral
real 0m7.479s
user 0m3.727s
sys 0m0.981s

*****
plib espeak-ng option rate 150 volume 125
real 0m0.386s
user 0m0.101s
sys 0m0.058s


$ ls -1 samples/
cepstral-charlie.wav
espeak-ng.wav
flite_ssml.wav <- sample with flite reading a file with SSML (speech synthesis markup language)
flite.wav
plib_espeak_ng.wav <- voice I am currently using for Carl -s150 -a125 option

Result

The very clear Cepstral Charlie voice comes at a cost of a 5 to 7 second delay before speaking, and a US$31.50 price!

jimrh · January 23, 2021, 1:18pm

Even if you ignore the relatively steep price tag for something like a clearer voice, (which is a design decision / judgement call), the five-plus second delay is entirely unacceptable from a UI perspective.

Congratulations for understanding that an engineering accomplishment should not only work, but be reasonable from a non-engineer user’s point of view. Many engineering types are totally incapable of making that mental jump in point-of-view.

Total props and a good call.

cyclicalobsessive · January 23, 2021, 2:19pm

And Carl’s POV…