EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

TTS "specification"

Started by Don Y May 30, 2015
Hi,

I have a few synthesizers that I've created (to explore different
approaches, resource requirements, complexity, quality, etc.).
Application is *semi* limited domain; i.e., I know *many* of the
things it will be asked to say -- but, not all!

I'm trying to sort out a means of grading their respective "quality".
I.e., I can easily measure how much text they occupy, how long they
take to process a given string, how much RAM they require, etc.

But, the tough part is trying to decide how "well they speak".
(e.g., I can make NOISES with very small text, RAM, MIPS... but,
you probably wouldn't consider those *noises* to be SPEECH!  :> )

Initially, I'm just looking at text-to-sound/phoneme portion of
the synthesizer.  E.g., skipping text normalization, prosody, etc.
Those are separate issues and the algorithms can be layered onto
any of the test-to-sound algorithms.

One OBVIOUS way of evaluating "quality" is just to feed them words
and see if they map those *graphemes* into the proper *phonemes*!
To that end, there are some pronouncing dictionaries that I can
use:  feed each word to a TTS and compare the resulting phonemes
to those that the dictionary claims are "correct" (assume a
pronunciation that matches that of any legitimate heteronym is
"correct").

But, even there, how can I score dictionary content without considering
the likelihood of encountering it?  E.g., if TTS#1 gets "boat" correct
but "syzygy" wrong, is that the same level of performance as TTS#2
getting "syzygy" *right* but "boat" wrong?!

Stepping back a bit further, how do I *specify* the desired performance
a priori given the unconstrained nature of the potential inputs?

Thx,
--don
The 2026 Embedded Online Conference