EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Embeddable text-to-speech

Started by pozz April 7, 2015
I'm designing an electronic gadget that will interact with humans 
through IVR (Interactive Voice Response) and keypad.  The user hears the 
voice and press some buttons to take some actions.

Most of the sentences are well known at design time, so I can think to 
generate and record them on the computer and save them on a memory (PCM, 
ADPCM, ...).  Unfortunately some sentences are customizable by the user, 
so they are known only at run-time.

So I'm thinking to TTS (Text To Speech) technology that generates 
whatever word/sentence at run-time, starting from the associated string.

How difficult is to integrate a TTS functionality in an electronic 
product?  What is the MCU power that TTS needs?  Do you know some TTS 
libraries that can be embeddable in an electronic project?  Do you know 
of some free libraries?

Please note, I don't need a real "on-the-fly" TTS. I could spend some 
time to generates the short message to play.
> How difficult is to integrate a TTS functionality in an electronic > product? What is the MCU power that TTS needs? Do you know some TTS > libraries that can be embeddable in an electronic project? Do you know > of some free libraries?
As long as you keep your requirements bounded, it should be easy to achieve your goals. As a point of reference, I was able to output 8 bit PCM code (wave files) at 8 kHz SR to a PWM bit with no problem on an MSP430 running at 8 MHz. The output ISR ran at the sample rate with about 75% loading. JJS
On 4/7/2015 5:17 AM, pozz wrote:
> I'm designing an electronic gadget that will interact with humans through IVR > (Interactive Voice Response) and keypad. The user hears the voice and press > some buttons to take some actions.
What quality of speech? What level of naturalness? A single voice? Or, user-selectable/customizable? Presumably entirely in English? Or, do you need to support other languages? Concurrently??
> Most of the sentences are well known at design time, so I can think to generate
This is called "limited domain synthesis". Think of TI's "Speak 'n' Spell" product ("The cow goes 'mooo'")
> and record them on the computer and save them on a memory (PCM, ADPCM, ...). > Unfortunately some sentences are customizable by the user, so they are known > only at run-time.
And there's the rub! How does the user indicate the message to be spoken? I.e., is it "unconstrained text" that you read from a char[]? Could the user opt to command it to speak "I'd rather have this bottle in front of me than a frontal lobotomy"? Or, will the sentences/phrases still be largely constrained by the application domain: "The date of your last withdrawal was..."? I.e., could you provide a set of ("prerecorded") words that the user can then "string together" to form messages? So, the actual message is created by the user but built from words that your device already knows how to speak? What happens if the user specifies a word that is hard to pronounce ("Phoenix", "salmon", "Worcester", etc.) using "canned" rules? What happens if the user specifies a "word" (sequence of letters) that is unpronounceable (Mr. Mxyzptlk)? How do you handle special characters (pronounce "%^*&%$!")? Acronyms (LPC, IVR, TTS, MCU, etc.)? Numbers (34; 2015; 1,093; 192388535; etc.)? Mixed strings ("Please call 555-1212 x342 between the hours of 8:00AM and 4:00PM CST")?
> So I'm thinking to TTS (Text To Speech) technology that generates whatever > word/sentence at run-time, starting from the associated string.
This always *sounds* like the right approach -- until you look into the issues that it drags in with it! It's really hard to come up with a "good" set of rules that can handle unconstrained input "practically" (abandoning the goal of "properly"!)
> How difficult is to integrate a TTS functionality in an electronic product?
Easy *or* hard -- depending on your constraints, goals, resources, expertise, etc.
> What is the MCU power that TTS needs?
That depends entirely on the constraints you are willing to impose and quality you seek. You can make noises that sound like speech with a 1MHz 8b CPU. If you only relied on it for occasional interactions, you could probably tolerate it. OTOH, you wouldn't want to listen to it for an appreciable period of time!
> Do you know some TTS libraries that can > be embeddable in an electronic project? Do you know of some free libraries?
Start at CMU's Hephaestus page. You might also want to look into "dialog systems". Also, don't forget to research intelligibility testing (e.g., modified rhyme test, anomalous sentences, etc.) as having speech that isn't intelligible is like having an LED indicator that's "burned out"! Invest some time in understanding the "listening prowess" of your target audience.
> Please note, I don't need a real "on-the-fly" TTS. I could spend some time to > generates the short message to play.
Note that if you try to synthesize and *then* play back, you need enough R/W store to hold the entire message as you are creating it. I.e., so it has been completely synthesized *prior* to beginning playback. If the user controls the content of the message, how do you ensure that you have *enough* space to store it? "This is a really long message that would, obviously, require considerably more memory to synthesize" Said another way, how do you handle the case when the user has asked you to speak something that is too long for your "buffer"? OTOH, running the synthesizer and playback concurrently allows you to shrink your buffer (to a size that just handles jitter in the algorithms) and speak phrases of "unlimited" length. [Of course, encoding prosody on-the-fly gets trickier]
Il 07/04/2015 20:27, Don Y ha scritto:
> On 4/7/2015 5:17 AM, pozz wrote: >> I'm designing an electronic gadget that will interact with humans >> through IVR >> (Interactive Voice Response) and keypad. The user hears the voice and >> press >> some buttons to take some actions. > > What quality of speech? What level of naturalness? A single voice? Or, > user-selectable/customizable? Presumably entirely in English? Or, do you > need to support other languages? Concurrently??
The voice should be as understandable as possible. Of course, greater quality is better, but I don't need high fidelity quality. At the moment, I need only Italian language and single voice. Not customizable.
>> Most of the sentences are well known at design time, so I can think to >> generate > > This is called "limited domain synthesis". Think of TI's "Speak 'n' Spell" > product ("The cow goes 'mooo'") > >> and record them on the computer and save them on a memory (PCM, ADPCM, >> ...). >> Unfortunately some sentences are customizable by the user, so they are >> known >> only at run-time. > > And there's the rub! > > How does the user indicate the message to be spoken?
I think this isn't important for TTS. The message will be stored in memory someway.
> I.e., is it > "unconstrained text" that you read from a char[]? Could the user > opt to command it to speak "I'd rather have this bottle in front of > me than a frontal lobotomy"? Or, will the sentences/phrases still > be largely constrained by the application domain: "The date of > your last withdrawal was..."?
In order to simplify the TTS implementation, I could constrain the user-customizable text to simple words, not sentences. So I will have some fixed/constant/"known at compile time" sentences that I can generates and save with high-quality TTS software on desktop computers. And some user-customizable words. For example: "Hello Rick, your air conditioner at the first ^^^^ ^^^^^^^^^^^^ floor has just switched off." ^^^^^ The words marked with carets ^ are user customizable.
> I.e., could you provide a set of ("prerecorded") words that the user > can then "string together" to form messages? So, the actual message > is created by the user but built from words that your device already > knows how to speak?
I already thought about this possibility, but it is a big limitation that I would prefer to avoid. Think of names (Rick in the example): it's impossible to have a full list of prerecorded names in the device.
> What happens if the user specifies a word that is hard to pronounce > ("Phoenix", "salmon", "Worcester", etc.) using "canned" rules?
The user can hear in the advance how his word is pronounced from the device. If it's too difficult to understand, he has the possibility to change the words with some other more understandable.
> What happens if the user specifies a "word" (sequence of letters) > that is unpronounceable (Mr. Mxyzptlk)?
The user will change it.
> How do you handle special characters (pronounce "%^*&%$!")? > Acronyms (LPC, IVR, TTS, MCU, etc.)? Numbers (34; 2015; 1,093; > 192388535; etc.)? Mixed strings ("Please call 555-1212 x342 > between the hours of 8:00AM and 4:00PM CST")?
Of course numbers must be well pronunced, for example for some settings (only small integers, in the range 0-100) and for times. But the sentences where the numbers are used are generated at compile time, so I can avoid 8:00AM or 4:00PM CST. The user will never creates sentences like those.
>> So I'm thinking to TTS (Text To Speech) technology that generates >> whatever >> word/sentence at run-time, starting from the associated string. > > This always *sounds* like the right approach -- until you look into the > issues that it drags in with it! It's really hard to come up with a > "good" set of rules that can handle unconstrained input "practically" > (abandoning the goal of "properly"!) > >> How difficult is to integrate a TTS functionality in an electronic >> product? > > Easy *or* hard -- depending on your constraints, goals, resources, > expertise, > etc. > >> What is the MCU power that TTS needs? > > That depends entirely on the constraints you are willing to impose and > quality you seek. You can make noises that sound like speech with a > 1MHz 8b CPU. If you only relied on it for occasional interactions, you > could probably tolerate it. OTOH, you wouldn't want to listen to it > for an appreciable period of time! > >> Do you know some TTS libraries that can >> be embeddable in an electronic project? Do you know of some free >> libraries? > > Start at CMU's Hephaestus page. You might also want to look into "dialog > systems". Also, don't forget to research intelligibility testing (e.g., > modified rhyme test, anomalous sentences, etc.) as having speech that > isn't intelligible is like having an LED indicator that's "burned out"! > > Invest some time in understanding the "listening prowess" of your target > audience. > >> Please note, I don't need a real "on-the-fly" TTS. I could spend some >> time to >> generates the short message to play. > > Note that if you try to synthesize and *then* play back, you need enough > R/W store to hold the entire message as you are creating it. I.e., so > it has been completely synthesized *prior* to beginning playback. > If the user controls the content of the message, how do you ensure that > you have *enough* space to store it? > "This is a really long message that would, obviously, require > considerably > more memory to synthesize" > Said another way, how do you handle the case when the user has asked you to > speak something that is too long for your "buffer"?
Of course, the user has a limited space to write words/sentences.
> OTOH, running the synthesizer and playback concurrently allows you to > shrink your buffer (to a size that just handles jitter in the algorithms) > and speak phrases of "unlimited" length. > > [Of course, encoding prosody on-the-fly gets trickier]
On 4/8/2015 12:36 AM, pozz wrote:
> Il 07/04/2015 20:27, Don Y ha scritto: >> On 4/7/2015 5:17 AM, pozz wrote: >>> I'm designing an electronic gadget that will interact with humans >>> through IVR >>> (Interactive Voice Response) and keypad. The user hears the voice and >>> press >>> some buttons to take some actions. >> >> What quality of speech? What level of naturalness? A single voice? Or, >> user-selectable/customizable? Presumably entirely in English? Or, do you >> need to support other languages? Concurrently?? > > The voice should be as understandable as possible. Of course, greater quality > is better, but I don't need high fidelity quality.
<grin> That really doesn't say much :-/ Have you listened to many synthetic voices? They range from *very* natural to "ick". Given that you appear to be pasting "compile-time" speech with "run-time" speech, are you willing to tolerate the sudden "voice/quality" change that will be apparent where you have "filled in the blanks" with run-time utterances? I.e., you can have very natural compile-time speech that is laced with potentially *crude* run-time phrases. The user will obviously know where the "filled in blanks" occur in the audio output (which may be acceptable to you). What might *not* be as acceptable is the change in *quality*/intelligibility that results.
> At the moment, I need only Italian language and single voice. Not customizable.
OK. Note that I'm speaking from an English language perspective. No idea how "uniform" the ruleset might be for Italian... (English is full of exceptions)
>>> Most of the sentences are well known at design time, so I can think to >>> generate >> >> This is called "limited domain synthesis". Think of TI's "Speak 'n' Spell" >> product ("The cow goes 'mooo'") >> >>> and record them on the computer and save them on a memory (PCM, ADPCM, >>> ...). >>> Unfortunately some sentences are customizable by the user, so they are >>> known >>> only at run-time. >> >> And there's the rub! >> >> How does the user indicate the message to be spoken? > > I think this isn't important for TTS. The message will be stored in memory > someway.
Sorry, perhaps my question wasn't clear enough. Does the user type in (somehow) a series of characters? Does he choose from among preselected words/phrases? etc. I.e., I am trying to ascertain how constrained/unconstrained the input will be. With a keyboard, a user could potentially type: "supercalifragilisticexpialidocious" OTOH, a user selecting phrases from preexisting choices (even if you actually synthesize the voice on-the-fly) has more limited choices: at the first floor at the second floor at the third floor etc.
>> I.e., is it >> "unconstrained text" that you read from a char[]? Could the user >> opt to command it to speak "I'd rather have this bottle in front of >> me than a frontal lobotomy"? Or, will the sentences/phrases still >> be largely constrained by the application domain: "The date of >> your last withdrawal was..."? > > In order to simplify the TTS implementation, I could constrain the > user-customizable text to simple words, not sentences.
That essentially eliminates the need for any prosody controls (as portions of the "sentence" will have been predefined and, thus, have their own prosody imposed irrespective of the "blanks filled in". But, can the user specify *any* word? "smartphone"? "technology"? "disillusionment"? "apartheid"?
> So I will have some fixed/constant/"known at compile time" sentences that I can > generates and save with high-quality TTS software on desktop computers. And > some user-customizable words. > > For example: > > "Hello Rick, your air conditioner at the first > ^^^^ ^^^^^^^^^^^^ > floor has just switched off." > ^^^^^ > > The words marked with carets ^ are user customizable.
So: on (at) the first floor on (at) the second floor on (at) the third floor in the penthouse in the basement in the garage Or, perhaps: in the basement of your clothing store for the dog kennel etc?
>> I.e., could you provide a set of ("prerecorded") words that the user >> can then "string together" to form messages? So, the actual message >> is created by the user but built from words that your device already >> knows how to speak? > > I already thought about this possibility, but it is a big limitation that I > would prefer to avoid. Think of names (Rick in the example): it's impossible > to have a full list of prerecorded names in the device.
Think, again, about that. Ignore, for the moment, proper names/nouns and, instead, concentrate on just *words*. You can store a rather large dictionary of words and their (encoded!) pronunciations if you can eliminate the code and the "rule sets" that determine how to convert "Rick" into /R/ /IH/ /K/. Furthermore, you could compress this "dictionary" by noting that you need only represent upper (or lower) -case alphas (RICK, rick, RiCk, etc. all result in the same pronunciation) and the corresponding sounds into which the "text" will be mapped. E.g., ~5 bits for each character in the "name" and ~6 bits for each sound. So, "Rick" requires 38 bits (about 5 bytes) to encode (alond with its pronunciation). At run-time, you need only convert the "sound codes" into actual "audio waveforms" -- instead of having to convert the textual representation of the name into the sound codes *and* then into waveforms. [I have no idea how large your vocabulary will be so no idea how large the dictionary would be.] Depending on the level of expertise of your users -- and the hoops through which they are willing to jump -- you could also direct them to specify the phrases *using* those sound codes. I.e., force them to do the "letter to sound" conversion in their heads -- possibly aided by allowing them to easily replay what they have just "typed": "Hmmm... that 'i' sound needs to be shorter. Let me try..." Given the variation in how proper names are pronounced, this may well be the best approach. E.g., my (english) ruleset would butcher "Alfio", "Gabriella", etc. I'm not sure it would even handle "ciao" properly!
>> What happens if the user specifies a word that is hard to pronounce >> ("Phoenix", "salmon", "Worcester", etc.) using "canned" rules? > > The user can hear in the advance how his word is pronounced from the device.
OK.
> If it's too difficult to understand, he has the possibility to change the words > with some other more understandable.
But, you still have to have rules that allow *you* to come up with a pronunciation. And, the user needs a way of coercing the device to pronounce the word the way he *wants* it to be pronounced. Does "read" rhyme with "tweed" or "bed"? I.e., a user wanting it to be pronounced in a particular way would have to misspell it "reed" or "red" (assuming the device picked the "wrong" pronunciation). Allowing the user to enter "sound codes" avoids that problem.
>> What happens if the user specifies a "word" (sequence of letters) >> that is unpronounceable (Mr. Mxyzptlk)? > > The user will change it. > >> How do you handle special characters (pronounce "%^*&%$!")? >> Acronyms (LPC, IVR, TTS, MCU, etc.)? Numbers (34; 2015; 1,093; >> 192388535; etc.)? Mixed strings ("Please call 555-1212 x342 >> between the hours of 8:00AM and 4:00PM CST")? > > Of course numbers must be well pronunced, for example for some settings (only > small integers, in the range 0-100) and for times. But the sentences where the > numbers are used are generated at compile time, so I can avoid 8:00AM or 4:00PM > CST. > The user will never creates sentences like those.
So, you need rules that allow <digit>[<digit>] to be pronounced one way while <digit>[<digit>]:<digit><digit> is pronounced another. Presumably, the mechanism by which the user specifies the "words" that he wants spoken will disallow any digits in that "text specification"? (Likewise, punctuation and other special symbols?)
>>> So I'm thinking to TTS (Text To Speech) technology that generates >>> whatever >>> word/sentence at run-time, starting from the associated string. >> >> This always *sounds* like the right approach -- until you look into the >> issues that it drags in with it! It's really hard to come up with a >> "good" set of rules that can handle unconstrained input "practically" >> (abandoning the goal of "properly"!) >> >>> How difficult is to integrate a TTS functionality in an electronic >>> product? >> >> Easy *or* hard -- depending on your constraints, goals, resources, >> expertise, >> etc. >> >>> What is the MCU power that TTS needs? >> >> That depends entirely on the constraints you are willing to impose and >> quality you seek. You can make noises that sound like speech with a >> 1MHz 8b CPU. If you only relied on it for occasional interactions, you >> could probably tolerate it. OTOH, you wouldn't want to listen to it >> for an appreciable period of time! >> >>> Do you know some TTS libraries that can >>> be embeddable in an electronic project? Do you know of some free >>> libraries? >> >> Start at CMU's Hephaestus page. You might also want to look into "dialog >> systems". Also, don't forget to research intelligibility testing (e.g., >> modified rhyme test, anomalous sentences, etc.) as having speech that >> isn't intelligible is like having an LED indicator that's "burned out"! >> >> Invest some time in understanding the "listening prowess" of your target >> audience. >> >>> Please note, I don't need a real "on-the-fly" TTS. I could spend some >>> time to >>> generates the short message to play. >> >> Note that if you try to synthesize and *then* play back, you need enough >> R/W store to hold the entire message as you are creating it. I.e., so >> it has been completely synthesized *prior* to beginning playback. >> If the user controls the content of the message, how do you ensure that >> you have *enough* space to store it? >> "This is a really long message that would, obviously, require >> considerably >> more memory to synthesize" >> Said another way, how do you handle the case when the user has asked you to >> speak something that is too long for your "buffer"? > > Of course, the user has a limited space to write words/sentences.
I'm not talking about specifying the text. Rather, I am addressing your comment about "spend some time to generates the short message to play". I.e., starting with "text", you'd have to convert the graphemes to phonemes; then, synthesize the audio waveform (however your output device expects to be fed) from these sound codes and prosodic envelope. The bulk of the "work" (CPU cycles) is in the creation of the waveforms. If you can't "keep up" with real time, then you need to be able to buffer the waveform while you create it -- and before you "utter" it. Yet, once you *start* to "speak" (i.e., push signal out the speaker), you probably can't arbitrarily stop/pause without affecting intelligibility (i.e., you'd have to make sure you only paused at word boundaries; never in the middle of a word) So, you need a buffer for all that "analog data". The number of characters in the "input word" has little to do with the duration of the utterance that will ultimately result. E.g., the /IH/ vowel sound (rIck) is probably half the duration of the /AY/ vowel sound (bIte). Note how long your "mouth is engaged" saying the two words. Or, "ewe"/"you" vs. "hit". (now you see why we call them "short" and "long" vowels! :> ) If you (your users) can tolerate the effort of "specifying sounds" instead of "specifying letters", it might be best to let them specify the text in that manner. At the very least, you could run the text-to-sound portion of the algorithm as soon as they have typed in the desired text and store the *sound* codes at that time -- to eliminate the effort of doing the conversion at "run time" (i.e., when the actual spoken output is *required*). Before you go too far down this road, you may want to explore some of the on-line synthesizers to get a feel for how robust they are, the quality of their voices, etc. (many are diphone based; you can actually make the synthesizer sound like a particular -- REAL -- *person*!) Then, explore some of the "cheaper" approaches (i.e., those that you are more likely to employ in your implementation). Get a feel for how the costs change -- as well as the "quality". At the very least, you'll get an appreciation for how much processing we automatically do when handling "combinations of characters" in particular, specific contexts.
>> OTOH, running the synthesizer and playback concurrently allows you to >> shrink your buffer (to a size that just handles jitter in the algorithms) >> and speak phrases of "unlimited" length. >> >> [Of course, encoding prosody on-the-fly gets trickier]
Il 08/04/2015 11:48, Don Y ha scritto:
> On 4/8/2015 12:36 AM, pozz wrote: >> Il 07/04/2015 20:27, Don Y ha scritto: >>> On 4/7/2015 5:17 AM, pozz wrote: >>>> I'm designing an electronic gadget that will interact with humans >>>> through IVR >>>> (Interactive Voice Response) and keypad. The user hears the voice and >>>> press >>>> some buttons to take some actions. >>> >>> What quality of speech? What level of naturalness? A single voice? >>> Or, >>> user-selectable/customizable? Presumably entirely in English? Or, >>> do you >>> need to support other languages? Concurrently?? >> >> The voice should be as understandable as possible. Of course, greater >> quality >> is better, but I don't need high fidelity quality. > > <grin> That really doesn't say much :-/ > > Have you listened to many synthetic voices? They range from *very* natural > to "ick". > > Given that you appear to be pasting "compile-time" speech with "run-time" > speech, are you willing to tolerate the sudden "voice/quality" change that > will be apparent where you have "filled in the blanks" with run-time > utterances? I.e., you can have very natural compile-time speech that > is laced with potentially *crude* run-time phrases.
You agree with me: high quality for "compile-time" sentences *and* for "run-time" senteces is better. But I don't need it. The device is for low-cost market, so the user won't have too much expectations.
> The user will obviously know where the "filled in blanks" occur in the > audio output (which may be acceptable to you). What might *not* be as > acceptable is the change in *quality*/intelligibility that results.
It will be acceptable. The change in "quality" corresponds exactly to the customizable words. So the user understands what happens.
>> At the moment, I need only Italian language and single voice. Not >> customizable. > > OK. Note that I'm speaking from an English language perspective. No idea > how "uniform" the ruleset might be for Italian... (English is full of > exceptions) > >>>> Most of the sentences are well known at design time, so I can think to >>>> generate >>> >>> This is called "limited domain synthesis". Think of TI's "Speak 'n' >>> Spell" >>> product ("The cow goes 'mooo'") >>> >>>> and record them on the computer and save them on a memory (PCM, ADPCM, >>>> ...). >>>> Unfortunately some sentences are customizable by the user, so they are >>>> known >>>> only at run-time. >>> >>> And there's the rub! >>> >>> How does the user indicate the message to be spoken? >> >> I think this isn't important for TTS. The message will be stored in >> memory >> someway. > > Sorry, perhaps my question wasn't clear enough. > > Does the user type in (somehow) a series of characters? Does he choose > from > among preselected words/phrases? etc. > > I.e., I am trying to ascertain how constrained/unconstrained the input > will be. > With a keyboard, a user could potentially type: > "supercalifragilisticexpialidocious" > OTOH, a user selecting phrases from preexisting choices (even if you > actually > synthesize the voice on-the-fly) has more limited choices: > at the first floor > at the second floor > at the third floor > etc.
The user can type any sequence of chars, but he is encouraged to play and check the result. If it is too noisy, he can change the words.
>>> I.e., is it >>> "unconstrained text" that you read from a char[]? Could the user >>> opt to command it to speak "I'd rather have this bottle in front of >>> me than a frontal lobotomy"? Or, will the sentences/phrases still >>> be largely constrained by the application domain: "The date of >>> your last withdrawal was..."? >> >> In order to simplify the TTS implementation, I could constrain the >> user-customizable text to simple words, not sentences. > > That essentially eliminates the need for any prosody controls (as > portions of the "sentence" will have been predefined and, thus, have > their own prosody imposed irrespective of the "blanks filled in". > > But, can the user specify *any* word? "smartphone"? "technology"? > "disillusionment"? "apartheid"?
Yes.
>> So I will have some fixed/constant/"known at compile time" sentences >> that I can >> generates and save with high-quality TTS software on desktop >> computers. And >> some user-customizable words. >> >> For example: >> >> "Hello Rick, your air conditioner at the first >> ^^^^ ^^^^^^^^^^^^ >> floor has just switched off." >> ^^^^^ >> >> The words marked with carets ^ are user customizable. > > So: > on (at) the first floor > on (at) the second floor > on (at) the third floor > in the penthouse > in the basement > in the garage > Or, perhaps: > in the basement of your clothing store > for the dog kennel > etc?
The user can write everything, but it is reasonable he writes simple words.
>>> I.e., could you provide a set of ("prerecorded") words that the user >>> can then "string together" to form messages? So, the actual message >>> is created by the user but built from words that your device already >>> knows how to speak? >> >> I already thought about this possibility, but it is a big limitation >> that I >> would prefer to avoid. Think of names (Rick in the example): it's >> impossible >> to have a full list of prerecorded names in the device. > > Think, again, about that. Ignore, for the moment, proper names/nouns and, > instead, concentrate on just *words*. You can store a rather large > dictionary > of words and their (encoded!) pronunciations if you can eliminate the code > and the "rule sets" that determine how to convert "Rick" into /R/ /IH/ /K/. > Furthermore, you could compress this "dictionary" by noting that you need > only represent upper (or lower) -case alphas (RICK, rick, RiCk, etc. all > result in the same pronunciation) and the corresponding sounds into which > the "text" will be mapped. E.g., ~5 bits for each character in the "name" > and ~6 bits for each sound. > > So, "Rick" requires 38 bits (about 5 bytes) to encode (alond with its > pronunciation). At run-time, you need only convert the "sound codes" > into actual "audio waveforms" -- instead of having to convert the > textual representation of the name into the sound codes *and* then > into waveforms. > > [I have no idea how large your vocabulary will be so no idea how large > the dictionary would be.]
I'm sure the user will want to use a words that isn't present in the dictionary. I'd prefer to avoid this way.
> Depending on the level of expertise of your users -- and the hoops through > which they are willing to jump -- you could also direct them to specify > the phrases *using* those sound codes. I.e., force them to do the > "letter to sound" conversion in their heads -- possibly aided by allowing > them to easily replay what they have just "typed": > "Hmmm... that 'i' sound needs to be shorter. Let me try..." > > Given the variation in how proper names are pronounced, this may well be > the best approach. E.g., my (english) ruleset would butcher "Alfio", > "Gabriella", etc. I'm not sure it would even handle "ciao" properly!
No, the user will not have this kind of expertise.
>>> What happens if the user specifies a word that is hard to pronounce >>> ("Phoenix", "salmon", "Worcester", etc.) using "canned" rules? >> >> The user can hear in the advance how his word is pronounced from the >> device. > > OK. > >> If it's too difficult to understand, he has the possibility to change >> the words >> with some other more understandable. > > But, you still have to have rules that allow *you* to come up with a > pronunciation. And, the user needs a way of coercing the device to > pronounce the word the way he *wants* it to be pronounced. > > Does "read" rhyme with "tweed" or "bed"? I.e., a user wanting it > to be pronounced in a particular way would have to misspell it "reed" > or "red" (assuming the device picked the "wrong" pronunciation). > > Allowing the user to enter "sound codes" avoids that problem.
I understand, but it's difficult to explain to my users. It is simpler to explain him misspelling the word in such a way the final result is similar to the sound he wants to hear.
>>> What happens if the user specifies a "word" (sequence of letters) >>> that is unpronounceable (Mr. Mxyzptlk)? >> >> The user will change it. >> >>> How do you handle special characters (pronounce "%^*&%$!")? >>> Acronyms (LPC, IVR, TTS, MCU, etc.)? Numbers (34; 2015; 1,093; >>> 192388535; etc.)? Mixed strings ("Please call 555-1212 x342 >>> between the hours of 8:00AM and 4:00PM CST")? >> >> Of course numbers must be well pronunced, for example for some >> settings (only >> small integers, in the range 0-100) and for times. But the sentences >> where the >> numbers are used are generated at compile time, so I can avoid 8:00AM >> or 4:00PM >> CST. >> The user will never creates sentences like those. > > So, you need rules that allow <digit>[<digit>] to be pronounced one way > while <digit>[<digit>]:<digit><digit> is pronounced another. > > Presumably, the mechanism by which the user specifies the "words" that > he wants spoken will disallow any digits in that "text specification"? > > (Likewise, punctuation and other special symbols?)
The user will never needs to customize texts with numbers or times. Numbers are managed at compile time.
>>>> So I'm thinking to TTS (Text To Speech) technology that generates >>>> whatever >>>> word/sentence at run-time, starting from the associated string. >>> >>> This always *sounds* like the right approach -- until you look into the >>> issues that it drags in with it! It's really hard to come up with a >>> "good" set of rules that can handle unconstrained input "practically" >>> (abandoning the goal of "properly"!) >>> >>>> How difficult is to integrate a TTS functionality in an electronic >>>> product? >>> >>> Easy *or* hard -- depending on your constraints, goals, resources, >>> expertise, >>> etc. >>> >>>> What is the MCU power that TTS needs? >>> >>> That depends entirely on the constraints you are willing to impose and >>> quality you seek. You can make noises that sound like speech with a >>> 1MHz 8b CPU. If you only relied on it for occasional interactions, you >>> could probably tolerate it. OTOH, you wouldn't want to listen to it >>> for an appreciable period of time! >>> >>>> Do you know some TTS libraries that can >>>> be embeddable in an electronic project? Do you know of some free >>>> libraries? >>> >>> Start at CMU's Hephaestus page. You might also want to look into >>> "dialog >>> systems". Also, don't forget to research intelligibility testing (e.g., >>> modified rhyme test, anomalous sentences, etc.) as having speech that >>> isn't intelligible is like having an LED indicator that's "burned out"! >>> >>> Invest some time in understanding the "listening prowess" of your target >>> audience. >>> >>>> Please note, I don't need a real "on-the-fly" TTS. I could spend some >>>> time to >>>> generates the short message to play. >>> >>> Note that if you try to synthesize and *then* play back, you need enough >>> R/W store to hold the entire message as you are creating it. I.e., so >>> it has been completely synthesized *prior* to beginning playback. >>> If the user controls the content of the message, how do you ensure that >>> you have *enough* space to store it? >>> "This is a really long message that would, obviously, require >>> considerably >>> more memory to synthesize" >>> Said another way, how do you handle the case when the user has asked >>> you to >>> speak something that is too long for your "buffer"? >> >> Of course, the user has a limited space to write words/sentences. > > I'm not talking about specifying the text. Rather, I am addressing your > comment about "spend some time to generates the short message to play". > I.e., starting with "text", you'd have to convert the graphemes to > phonemes; then, synthesize the audio waveform (however your output > device expects to be fed) from these sound codes and prosodic envelope. > > The bulk of the "work" (CPU cycles) is in the creation of the waveforms. > If you can't "keep up" with real time, then you need to be able to buffer > the waveform while you create it -- and before you "utter" it. Yet, once > you *start* to "speak" (i.e., push signal out the speaker), you probably > can't arbitrarily stop/pause without affecting intelligibility (i.e., you'd > have to make sure you only paused at word boundaries; never in the middle > of a word) > > So, you need a buffer for all that "analog data". The number of characters > in the "input word" has little to do with the duration of the utterance > that will ultimately result. > > E.g., the /IH/ vowel sound (rIck) is probably half the duration of > the /AY/ vowel sound (bIte). Note how long your "mouth is engaged" > saying the two words. Or, "ewe"/"you" vs. "hit". (now you see why > we call them "short" and "long" vowels! :> )
I understand, but you agree with me that a short text (a small number of chars) corresponds to a short waveform duration. I can calculate a worst case for a certain number of chars.
> If you (your users) can tolerate the effort of "specifying sounds" > instead of "specifying letters", it might be best to let them > specify the text in that manner. > > At the very least, you could run the text-to-sound portion of the > algorithm as soon as they have typed in the desired text and store > the *sound* codes at that time -- to eliminate the effort of > doing the conversion at "run time" (i.e., when the actual spoken > output is *required*). > > Before you go too far down this road, you may want to explore some of the > on-line synthesizers to get a feel for how robust they are, the quality > of their voices, etc. (many are diphone based; you can actually make > the synthesizer sound like a particular -- REAL -- *person*!) > > Then, explore some of the "cheaper" approaches (i.e., those that you are > more likely to employ in your implementation). Get a feel for how > the costs change -- as well as the "quality". > > At the very least, you'll get an appreciation for how much processing > we automatically do when handling "combinations of characters" in > particular, > specific contexts.
Yes, I'm trying to understand text-to-speech world, but it seems too difficult for me. I hoped it was possible to embed some ready-to-use TTS libraries (free of charge or after payment) as source codes or object files, without being a TTS expert. It seems, this isn't the case. Anyway, Thank you very much for your explanations and time.
> >>> OTOH, running the synthesizer and playback concurrently allows you to >>> shrink your buffer (to a size that just handles jitter in the >>> algorithms) >>> and speak phrases of "unlimited" length. >>> >>> [Of course, encoding prosody on-the-fly gets trickier]
On 4/8/2015 6:58 AM, pozz wrote:

>>> The voice should be as understandable as possible. Of course, greater >>> quality >>> is better, but I don't need high fidelity quality. >> >> <grin> That really doesn't say much :-/ >> >> Have you listened to many synthetic voices? They range from *very* natural >> to "ick". >> >> Given that you appear to be pasting "compile-time" speech with "run-time" >> speech, are you willing to tolerate the sudden "voice/quality" change that >> will be apparent where you have "filled in the blanks" with run-time >> utterances? I.e., you can have very natural compile-time speech that >> is laced with potentially *crude* run-time phrases. > > You agree with me: high quality for "compile-time" sentences *and* for > "run-time" senteces is better. But I don't need it. The device is for > low-cost market, so the user won't have too much expectations.
Only *you* can comment on your market and what it will accept. I'm just pointing out that there *will* be a very noticeable "pieced together" feel (sound) to it. Have you also considered just letting the user *record* his messages (i.e., using his own voice via a microphone *or* "downloading" it into the device from a "PC")?
>> The user will obviously know where the "filled in blanks" occur in the >> audio output (which may be acceptable to you). What might *not* be as >> acceptable is the change in *quality*/intelligibility that results. > > It will be acceptable. The change in "quality" corresponds exactly to the > customizable words. So the user understands what happens.
>>> At the moment, I need only Italian language and single voice. Not >>> customizable.
> The user can type any sequence of chars, but he is encouraged to play and check > the result. If it is too noisy, he can change the words.
OK. In my application, the user has no "preview" capability. So, he has to be able to recognize what the device (as a proxy) is trying to "tell" him regardless of the complexity of that (unconstrained) input. As such, I have controls that allow him to replay messages, "spell" individual words/numbers, change the characteristics of the speech (pitch, rate, etc.) to be more intelligible, etc.
>> But, can the user specify *any* word? "smartphone"? "technology"? >> "disillusionment"? "apartheid"? > > Yes.
> The user can write everything, but it is reasonable he writes simple words.
>> So, "Rick" requires 38 bits (about 5 bytes) to encode (alond with its >> pronunciation). At run-time, you need only convert the "sound codes" >> into actual "audio waveforms" -- instead of having to convert the >> textual representation of the name into the sound codes *and* then >> into waveforms. >> >> [I have no idea how large your vocabulary will be so no idea how large >> the dictionary would be.] > > I'm sure the user will want to use a words that isn't present in the > dictionary. I'd prefer to avoid this way.
Then your TTS rules will need to address every potential case. Note, however, that if your rules are *intuitive*, users will quickly learn how to misspell the text in order to get an acceptable pronunciation. E.g., in English, the only (phonetic) use for the letter 'C' in the input text is to represent the "CH" sound. All other C's can be replaced with 'S' or 'K'. You can probably also eliminate a lot of the subtle differences in sounds that would promote more naturalness. E.g., (in English), the 'N' sound in "Next" is subtly different from that in "buttoN"; likewise, the 'L' in "Let" vs. "piLL"; the 'R' in "Ready" vs. "tiRe"; the 'W' in "Which" vs. "Wet"; etc. Find a word-list of "common" words (in Italian) and prepare to feed them to your TTS to see how good/bad the resulting pronunciation. And, for those that are less than ideal, see if you can misspell them in ways that make their pronunciations more acceptable. Finally, look at those misspellings and see if a user could readily come to the same sort of realization (*if* the pronunciation of the proper spelling was "bad enough" to warrant)
>> Depending on the level of expertise of your users -- and the hoops through >> which they are willing to jump -- you could also direct them to specify >> the phrases *using* those sound codes. I.e., force them to do the >> "letter to sound" conversion in their heads -- possibly aided by allowing >> them to easily replay what they have just "typed": >> "Hmmm... that 'i' sound needs to be shorter. Let me try..." >> >> Given the variation in how proper names are pronounced, this may well be >> the best approach. E.g., my (english) ruleset would butcher "Alfio", >> "Gabriella", etc. I'm not sure it would even handle "ciao" properly! > > No, the user will not have this kind of expertise.
See above. Sorry, I can't comment on appropriate "bastardizations" for Italian. But, in English, a "motivated user" can usually come up with ways to coax a TTS into "uttering the sounds" that he'd like to hear. Important tip: be sure to encode some basic punctuation. People quickly learn that they can influence "playback" if they insert a ',' to force a small pause at a certain point in the text; a '.' for a longer pause; etc. If you also tried to encode prosody (doubtful given your description), things like '!' and '?' could be artificially injected to influence that.
>>> If it's too difficult to understand, he has the possibility to change >>> the words >>> with some other more understandable. >> >> But, you still have to have rules that allow *you* to come up with a >> pronunciation. And, the user needs a way of coercing the device to >> pronounce the word the way he *wants* it to be pronounced. >> >> Does "read" rhyme with "tweed" or "bed"? I.e., a user wanting it >> to be pronounced in a particular way would have to misspell it "reed" >> or "red" (assuming the device picked the "wrong" pronunciation). >> >> Allowing the user to enter "sound codes" avoids that problem. > > I understand, but it's difficult to explain to my users. It is simpler to > explain him misspelling the word in such a way the final result is similar to > the sound he wants to hear.
The result will be the same -- *if* your ruleset is simple/obvious. E.g., "'c' only makes sense in 'ch', else 'k'". E.g., I would encode "ciao" as "chow" to get the pronunciation I (English) sought.
>> So, you need rules that allow <digit>[<digit>] to be pronounced one way >> while <digit>[<digit>]:<digit><digit> is pronounced another. >> >> Presumably, the mechanism by which the user specifies the "words" that >> he wants spoken will disallow any digits in that "text specification"? >> >> (Likewise, punctuation and other special symbols?) > > The user will never needs to customize texts with numbers or times. Numbers are > managed at compile time.
So, he'd never say "The air conditioner in room 307 has just switched off"? Or, if he wanted to do so, he would be expected to write it as "The air conditioner in room three oh seven has just switched off".
>>>> Start at CMU's Hephaestus page. You might also want to look into >>>> "dialog >>>> systems". Also, don't forget to research intelligibility testing (e.g., >>>> modified rhyme test, anomalous sentences, etc.) as having speech that >>>> isn't intelligible is like having an LED indicator that's "burned out"! >>>> >>>> Invest some time in understanding the "listening prowess" of your target >>>> audience.
>>> Of course, the user has a limited space to write words/sentences. >> >> I'm not talking about specifying the text. Rather, I am addressing your >> comment about "spend some time to generates the short message to play". >> I.e., starting with "text", you'd have to convert the graphemes to >> phonemes; then, synthesize the audio waveform (however your output >> device expects to be fed) from these sound codes and prosodic envelope. >> >> The bulk of the "work" (CPU cycles) is in the creation of the waveforms. >> If you can't "keep up" with real time, then you need to be able to buffer >> the waveform while you create it -- and before you "utter" it. Yet, once >> you *start* to "speak" (i.e., push signal out the speaker), you probably >> can't arbitrarily stop/pause without affecting intelligibility (i.e., you'd >> have to make sure you only paused at word boundaries; never in the middle >> of a word) >> >> So, you need a buffer for all that "analog data". The number of characters >> in the "input word" has little to do with the duration of the utterance >> that will ultimately result. >> >> E.g., the /IH/ vowel sound (rIck) is probably half the duration of >> the /AY/ vowel sound (bIte). Note how long your "mouth is engaged" >> saying the two words. Or, "ewe"/"you" vs. "hit". (now you see why >> we call them "short" and "long" vowels! :> ) > > I understand, but you agree with me that a short text (a small number of chars) > corresponds to a short waveform duration. I can calculate a worst case for a > certain number of chars.
I think you will find this isn't as obvious/easy as you expect. You may find it easier to just "run a tighter loop" -- possibly dismissing other activities at the time -- and synthesize on the fly. This can dramatically shrink your memory (buffer) requirements. If you are willing to accept "crude" for the "filled in blanks", then a lot of processing can be skipped (e.g., prosody -- just rattle those things off in a monotone)
>> If you (your users) can tolerate the effort of "specifying sounds" >> instead of "specifying letters", it might be best to let them >> specify the text in that manner. >> >> At the very least, you could run the text-to-sound portion of the >> algorithm as soon as they have typed in the desired text and store >> the *sound* codes at that time -- to eliminate the effort of >> doing the conversion at "run time" (i.e., when the actual spoken >> output is *required*). >> >> Before you go too far down this road, you may want to explore some of the >> on-line synthesizers to get a feel for how robust they are, the quality >> of their voices, etc. (many are diphone based; you can actually make >> the synthesizer sound like a particular -- REAL -- *person*!) >> >> Then, explore some of the "cheaper" approaches (i.e., those that you are >> more likely to employ in your implementation). Get a feel for how >> the costs change -- as well as the "quality". >> >> At the very least, you'll get an appreciation for how much processing >> we automatically do when handling "combinations of characters" in >> particular, specific contexts. > > Yes, I'm trying to understand text-to-speech world, but it seems too difficult > for me.
It's not easy. Speech (and language) have lots of subtleties that we take for granted in our daily life. Why "nickEL", yet "pickLE"? Hopefully, Italian (as a language) is "more regular" than English. (ISTR some of the Scandinavian languages are very "regular")
> I hoped it was possible to embed some ready-to-use TTS libraries (free > of charge or after payment) as source codes or object files, without being a > TTS expert. It seems, this isn't the case.
You can play with flite but I think you will find it too large for your needs. There are several other "open" TTS implementations (though not sure how well suited to Italian their rulesets would be) but, most suffer from the same "lack of concern for resources" that you might encounter in a deeply embedded product. I'm starting my *third* version (different approach than either of the first two) at a "lean TTS" and suspect I will be disappointed with that, as well (primarily due to the unconstrained vocabulary consequences -- it's always easy to come up with "typical" things that are difficult to handle WITHOUT making the algorithms incredibly complex) Good luck! --don
> Anyway, Thank you very much for your explanations and time. > >> >>>> OTOH, running the synthesizer and playback concurrently allows you to >>>> shrink your buffer (to a size that just handles jitter in the >>>> algorithms) >>>> and speak phrases of "unlimited" length. >>>> >>>> [Of course, encoding prosody on-the-fly gets trickier] >
On Wed, 08 Apr 2015 15:58:27 +0200, pozz <pozzugno@gmail.com> wrote:

>Il 08/04/2015 11:48, Don Y ha scritto: >> On 4/8/2015 12:36 AM, pozz wrote: >>> Il 07/04/2015 20:27, Don Y ha scritto: >>>> On 4/7/2015 5:17 AM, pozz wrote: >>>>> I'm designing an electronic gadget that will interact with humans >>>>> through IVR >>>>> (Interactive Voice Response) and keypad. The user hears the voice and >>>>> press >>>>> some buttons to take some actions. >>>> >>>> What quality of speech? What level of naturalness? A single voice? >>>> Or, >>>> user-selectable/customizable? Presumably entirely in English? Or, >>>> do you >>>> need to support other languages? Concurrently?? >>> >>> The voice should be as understandable as possible. Of course, greater >>> quality >>> is better, but I don't need high fidelity quality. >> >> <grin> That really doesn't say much :-/ >> >> Have you listened to many synthetic voices? They range from *very* natural >> to "ick". >> >> Given that you appear to be pasting "compile-time" speech with "run-time" >> speech, are you willing to tolerate the sudden "voice/quality" change that >> will be apparent where you have "filled in the blanks" with run-time >> utterances? I.e., you can have very natural compile-time speech that >> is laced with potentially *crude* run-time phrases. > >You agree with me: high quality for "compile-time" sentences *and* for >"run-time" senteces is better. But I don't need it. The device is for >low-cost market, so the user won't have too much expectations. > > >> The user will obviously know where the "filled in blanks" occur in the >> audio output (which may be acceptable to you). What might *not* be as >> acceptable is the change in *quality*/intelligibility that results. > >It will be acceptable. The change in "quality" corresponds exactly to >the customizable words. So the user understands what happens. > > >>> At the moment, I need only Italian language and single voice. Not >>> customizable. >> >> OK. Note that I'm speaking from an English language perspective. No idea >> how "uniform" the ruleset might be for Italian... (English is full of >> exceptions) >> >>>>> Most of the sentences are well known at design time, so I can think to >>>>> generate >>>> >>>> This is called "limited domain synthesis". Think of TI's "Speak 'n' >>>> Spell" >>>> product ("The cow goes 'mooo'") >>>> >>>>> and record them on the computer and save them on a memory (PCM, ADPCM, >>>>> ...). >>>>> Unfortunately some sentences are customizable by the user, so they are >>>>> known >>>>> only at run-time. >>>> >>>> And there's the rub! >>>> >>>> How does the user indicate the message to be spoken? >>> >>> I think this isn't important for TTS. The message will be stored in >>> memory >>> someway. >> >> Sorry, perhaps my question wasn't clear enough. >> >> Does the user type in (somehow) a series of characters? Does he choose >> from >> among preselected words/phrases? etc. >> >> I.e., I am trying to ascertain how constrained/unconstrained the input >> will be. >> With a keyboard, a user could potentially type: >> "supercalifragilisticexpialidocious" >> OTOH, a user selecting phrases from preexisting choices (even if you >> actually >> synthesize the voice on-the-fly) has more limited choices: >> at the first floor >> at the second floor >> at the third floor >> etc. > >The user can type any sequence of chars, but he is encouraged to play >and check the result. If it is too noisy, he can change the words. > > >>>> I.e., is it >>>> "unconstrained text" that you read from a char[]? Could the user >>>> opt to command it to speak "I'd rather have this bottle in front of >>>> me than a frontal lobotomy"? Or, will the sentences/phrases still >>>> be largely constrained by the application domain: "The date of >>>> your last withdrawal was..."? >>> >>> In order to simplify the TTS implementation, I could constrain the >>> user-customizable text to simple words, not sentences. >> >> That essentially eliminates the need for any prosody controls (as >> portions of the "sentence" will have been predefined and, thus, have >> their own prosody imposed irrespective of the "blanks filled in". >> >> But, can the user specify *any* word? "smartphone"? "technology"? >> "disillusionment"? "apartheid"? > >Yes. > > >>> So I will have some fixed/constant/"known at compile time" sentences >>> that I can >>> generates and save with high-quality TTS software on desktop >>> computers. And >>> some user-customizable words. >>> >>> For example: >>> >>> "Hello Rick, your air conditioner at the first >>> ^^^^ ^^^^^^^^^^^^ >>> floor has just switched off." >>> ^^^^^ >>> >>> The words marked with carets ^ are user customizable. >> >> So: >> on (at) the first floor >> on (at) the second floor >> on (at) the third floor >> in the penthouse >> in the basement >> in the garage >> Or, perhaps: >> in the basement of your clothing store >> for the dog kennel >> etc? > >The user can write everything, but it is reasonable he writes simple words. > >>>> I.e., could you provide a set of ("prerecorded") words that the user >>>> can then "string together" to form messages? So, the actual message >>>> is created by the user but built from words that your device already >>>> knows how to speak? >>> >>> I already thought about this possibility, but it is a big limitation >>> that I >>> would prefer to avoid. Think of names (Rick in the example): it's >>> impossible >>> to have a full list of prerecorded names in the device. >> >> Think, again, about that. Ignore, for the moment, proper names/nouns and, >> instead, concentrate on just *words*. You can store a rather large >> dictionary >> of words and their (encoded!) pronunciations if you can eliminate the code >> and the "rule sets" that determine how to convert "Rick" into /R/ /IH/ /K/. >> Furthermore, you could compress this "dictionary" by noting that you need >> only represent upper (or lower) -case alphas (RICK, rick, RiCk, etc. all >> result in the same pronunciation) and the corresponding sounds into which >> the "text" will be mapped. E.g., ~5 bits for each character in the "name" >> and ~6 bits for each sound. >> >> So, "Rick" requires 38 bits (about 5 bytes) to encode (alond with its >> pronunciation). At run-time, you need only convert the "sound codes" >> into actual "audio waveforms" -- instead of having to convert the >> textual representation of the name into the sound codes *and* then >> into waveforms. >> >> [I have no idea how large your vocabulary will be so no idea how large >> the dictionary would be.] > >I'm sure the user will want to use a words that isn't present in the >dictionary. I'd prefer to avoid this way. > > >> Depending on the level of expertise of your users -- and the hoops through >> which they are willing to jump -- you could also direct them to specify >> the phrases *using* those sound codes. I.e., force them to do the >> "letter to sound" conversion in their heads -- possibly aided by allowing >> them to easily replay what they have just "typed": >> "Hmmm... that 'i' sound needs to be shorter. Let me try..." >> >> Given the variation in how proper names are pronounced, this may well be >> the best approach. E.g., my (english) ruleset would butcher "Alfio", >> "Gabriella", etc. I'm not sure it would even handle "ciao" properly! > >No, the user will not have this kind of expertise. > > >>>> What happens if the user specifies a word that is hard to pronounce >>>> ("Phoenix", "salmon", "Worcester", etc.) using "canned" rules? >>> >>> The user can hear in the advance how his word is pronounced from the >>> device. >> >> OK. >> >>> If it's too difficult to understand, he has the possibility to change >>> the words >>> with some other more understandable. >> >> But, you still have to have rules that allow *you* to come up with a >> pronunciation. And, the user needs a way of coercing the device to >> pronounce the word the way he *wants* it to be pronounced. >> >> Does "read" rhyme with "tweed" or "bed"? I.e., a user wanting it >> to be pronounced in a particular way would have to misspell it "reed" >> or "red" (assuming the device picked the "wrong" pronunciation). >> >> Allowing the user to enter "sound codes" avoids that problem. > >I understand, but it's difficult to explain to my users. It is simpler >to explain him misspelling the word in such a way the final result is >similar to the sound he wants to hear. > > >>>> What happens if the user specifies a "word" (sequence of letters) >>>> that is unpronounceable (Mr. Mxyzptlk)? >>> >>> The user will change it. >>> >>>> How do you handle special characters (pronounce "%^*&%$!")? >>>> Acronyms (LPC, IVR, TTS, MCU, etc.)? Numbers (34; 2015; 1,093; >>>> 192388535; etc.)? Mixed strings ("Please call 555-1212 x342 >>>> between the hours of 8:00AM and 4:00PM CST")? >>> >>> Of course numbers must be well pronunced, for example for some >>> settings (only >>> small integers, in the range 0-100) and for times. But the sentences >>> where the >>> numbers are used are generated at compile time, so I can avoid 8:00AM >>> or 4:00PM >>> CST. >>> The user will never creates sentences like those. >> >> So, you need rules that allow <digit>[<digit>] to be pronounced one way >> while <digit>[<digit>]:<digit><digit> is pronounced another. >> >> Presumably, the mechanism by which the user specifies the "words" that >> he wants spoken will disallow any digits in that "text specification"? >> >> (Likewise, punctuation and other special symbols?) > >The user will never needs to customize texts with numbers or times. >Numbers are managed at compile time. > > >>>>> So I'm thinking to TTS (Text To Speech) technology that generates >>>>> whatever >>>>> word/sentence at run-time, starting from the associated string. >>>> >>>> This always *sounds* like the right approach -- until you look into the >>>> issues that it drags in with it! It's really hard to come up with a >>>> "good" set of rules that can handle unconstrained input "practically" >>>> (abandoning the goal of "properly"!) >>>> >>>>> How difficult is to integrate a TTS functionality in an electronic >>>>> product? >>>> >>>> Easy *or* hard -- depending on your constraints, goals, resources, >>>> expertise, >>>> etc. >>>> >>>>> What is the MCU power that TTS needs? >>>> >>>> That depends entirely on the constraints you are willing to impose and >>>> quality you seek. You can make noises that sound like speech with a >>>> 1MHz 8b CPU. If you only relied on it for occasional interactions, you >>>> could probably tolerate it. OTOH, you wouldn't want to listen to it >>>> for an appreciable period of time! >>>> >>>>> Do you know some TTS libraries that can >>>>> be embeddable in an electronic project? Do you know of some free >>>>> libraries? >>>> >>>> Start at CMU's Hephaestus page. You might also want to look into >>>> "dialog >>>> systems". Also, don't forget to research intelligibility testing (e.g., >>>> modified rhyme test, anomalous sentences, etc.) as having speech that >>>> isn't intelligible is like having an LED indicator that's "burned out"! >>>> >>>> Invest some time in understanding the "listening prowess" of your target >>>> audience. >>>> >>>>> Please note, I don't need a real "on-the-fly" TTS. I could spend some >>>>> time to >>>>> generates the short message to play. >>>> >>>> Note that if you try to synthesize and *then* play back, you need enough >>>> R/W store to hold the entire message as you are creating it. I.e., so >>>> it has been completely synthesized *prior* to beginning playback. >>>> If the user controls the content of the message, how do you ensure that >>>> you have *enough* space to store it? >>>> "This is a really long message that would, obviously, require >>>> considerably >>>> more memory to synthesize" >>>> Said another way, how do you handle the case when the user has asked >>>> you to >>>> speak something that is too long for your "buffer"? >>> >>> Of course, the user has a limited space to write words/sentences. >> >> I'm not talking about specifying the text. Rather, I am addressing your >> comment about "spend some time to generates the short message to play". >> I.e., starting with "text", you'd have to convert the graphemes to >> phonemes; then, synthesize the audio waveform (however your output >> device expects to be fed) from these sound codes and prosodic envelope. >> >> The bulk of the "work" (CPU cycles) is in the creation of the waveforms. >> If you can't "keep up" with real time, then you need to be able to buffer >> the waveform while you create it -- and before you "utter" it. Yet, once >> you *start* to "speak" (i.e., push signal out the speaker), you probably >> can't arbitrarily stop/pause without affecting intelligibility (i.e., you'd >> have to make sure you only paused at word boundaries; never in the middle >> of a word) >> >> So, you need a buffer for all that "analog data". The number of characters >> in the "input word" has little to do with the duration of the utterance >> that will ultimately result. >> >> E.g., the /IH/ vowel sound (rIck) is probably half the duration of >> the /AY/ vowel sound (bIte). Note how long your "mouth is engaged" >> saying the two words. Or, "ewe"/"you" vs. "hit". (now you see why >> we call them "short" and "long" vowels! :> ) > >I understand, but you agree with me that a short text (a small number of >chars) corresponds to a short waveform duration. I can calculate a >worst case for a certain number of chars. > > >> If you (your users) can tolerate the effort of "specifying sounds" >> instead of "specifying letters", it might be best to let them >> specify the text in that manner. >> >> At the very least, you could run the text-to-sound portion of the >> algorithm as soon as they have typed in the desired text and store >> the *sound* codes at that time -- to eliminate the effort of >> doing the conversion at "run time" (i.e., when the actual spoken >> output is *required*). >> >> Before you go too far down this road, you may want to explore some of the >> on-line synthesizers to get a feel for how robust they are, the quality >> of their voices, etc. (many are diphone based; you can actually make >> the synthesizer sound like a particular -- REAL -- *person*!) >> >> Then, explore some of the "cheaper" approaches (i.e., those that you are >> more likely to employ in your implementation). Get a feel for how >> the costs change -- as well as the "quality". >> >> At the very least, you'll get an appreciation for how much processing >> we automatically do when handling "combinations of characters" in >> particular, >> specific contexts. > >Yes, I'm trying to understand text-to-speech world, but it seems too >difficult for me. I hoped it was possible to embed some ready-to-use >TTS libraries (free of charge or after payment) as source codes or >object files, without being a TTS expert. It seems, this isn't the case. > >Anyway, Thank you very much for your explanations and time. > >> >>>> OTOH, running the synthesizer and playback concurrently allows you to >>>> shrink your buffer (to a size that just handles jitter in the >>>> algorithms) >>>> and speak phrases of "unlimited" length. >>>> >>>> [Of course, encoding prosody on-the-fly gets trickier]
I'm a bit unclear on your scenario. Are you going to be generating the speech offline from the device, and then installing the resulting sound file (.wav, etc.) on the device? If so, there are a number of possible ways to do that without too much work. Windows, for example, has a built in TTS system, and an API an application can use ("SAPI"). An obvious use case is with direct output to the user, but you can also write output to a .wav file. https://msdn.microsoft.com/en-us/library/ms717065(v=vs.85).aspx Windows comes with a built in TTS engine, which does a pretty good job for general use (it's the basis for MS's default screen reader), and has likely had a ton more work put into parsing an analysis of text than you could justify. But if it's not good enough, there are third party plug-in TTS engines that you can add as well. These usually add other voices and additional customization options. Even if you weren't primarily doing your management on a Windows machine, you ought to be able to toss a Windows box or two in a corner as a TTS .wav file server. I believe MS uses the same SAPI on their mobile systems as well. I'm sure similar exists for Linux. There is a TTS package and API for Android. That might be usable, even if you have to run Android on a machine as a server. My understanding is that it uses the same text analysis engine as Google Translate does, and Google translate has a TTS option as well (use do English-to-English as the translation and select My guess it's that the same TTS back end as what in the Android package). It may well be that there's an API or service you can use in there somewhere. And the Android version is presumably open source, although I'm sure it's going to be a handful. Even if you weren't planning on doing this offline, there are some advantages to that, especially if the device (or management application) has internet access - there's a big lump of code you don't have to distribute and run on the device.
Hi Robert,

On 4/8/2015 3:20 PM, Robert Wessel wrote:
> On Wed, 08 Apr 2015 15:58:27 +0200, pozz <pozzugno@gmail.com> wrote:
> I'm a bit unclear on your scenario.
<grin> Join the club! ISTM that the OP wants to have (reasonably) high quality *canned* phrases/sentences into which the user can "salt" user-specific data/phrases: "The _____________ device has reported a power failure." "Your _______ door has been opened!" "The _________ seems to be running too hot." The canned portion can obviously be "processed" (whatever THAT means) at compile-time as they are invariant. But, the "blanks" need to be created "at CONFIGURATION time" (which, presumably, is somewhere between compile-time and run-time). Further, the content for those "blanks" is relatively unconstrained and may include "words" that defy traditional TTS algorithms. E.g., names (how do *you* pronounce "Berlin"?).
> Are you going to be generating the speech offline from the device, and > then installing the resulting sound file (.wav, etc.) on the device? > If so, there are a number of possible ways to do that without too much > work. > > Windows, for example, has a built in TTS system, and an API an > application can use ("SAPI"). An obvious use case is with direct > output to the user, but you can also write output to a .wav file. > > https://msdn.microsoft.com/en-us/library/ms717065(v=vs.85).aspx
I really don't understand the need for a compile time TTS! Why not just *record* the speech and then encode it <however>? Why let an (inferior) algorithm try to come up with "natural sounding" speech when you could find a genuine human being to do this?? I am trying to understand a situation where "storing" a message in "audio" form makes sense given that he plans on having some TTS capability in the product. AFAICT, the only advantage comes if you can't do the synthesis on-the-fly and have to resort to building output waveforms in volatile memory at run-time; this hybrid approach could let you shrink the amount of such memory in favor of "ROM" with the canned representations. [OTOH, a cleverer approach could synthesize everything "in small word groups" and piece them together -- with pauses between] ISTM, that storing the canned portions in the same "bastardized spellings" that were discussed up-thread and letting the TTS synthesize *everything* would be the better approach. E.g., I run *all* of my "canned text" through the TTS engine in my device just to eliminate the burden on the developer of having to "precompile" the "spoken output". But, the OP understands his market better than I...
> Windows comes with a built in TTS engine, which does a pretty good job > for general use (it's the basis for MS's default screen reader), and > has likely had a ton more work put into parsing an analysis of text > than you could justify. But if it's not good enough, there are third > party plug-in TTS engines that you can add as well. These usually add > other voices and additional customization options.
Let the *user* download a .WAV file from *his* PC. Then, just concentrate on being able to reproduce those files accurately (given that they may contain "wonkiness"). Reserve a portion of your flash to hold messages? Add something to verify that portion of the flash contains something that *looks* like a message?? (hey, user may opt to store sound-effects instead of actual "spoken speech") [There are some low vision aids that just let the user record their own voice in place of accepting text for <whatever>. Then, the device simply plays back their recording when they want to "access" that "data": "This is a can of corn niblets"; "Appointment at dentist on Friday"; etc.]
> Even if you weren't primarily doing your management on a Windows > machine, you ought to be able to toss a Windows box or two in a corner > as a TTS .wav file server. > > I believe MS uses the same SAPI on their mobile systems as well. > > I'm sure similar exists for Linux. > > There is a TTS package and API for Android. That might be usable, > even if you have to run Android on a machine as a server. My > understanding is that it uses the same text analysis engine as Google > Translate does, and Google translate has a TTS option as well (use do > English-to-English as the translation and select My guess it's that > the same TTS back end as what in the Android package). It may well be > that there's an API or service you can use in there somewhere. > > And the Android version is presumably open source, although I'm sure > it's going to be a handful. > > Even if you weren't planning on doing this offline, there are some > advantages to that, especially if the device (or management > application) has internet access - there's a big lump of code you > don't have to distribute and run on the device.
The bulk of the code involved in TTS lies in the "rules" by which text is evaluated in context, etc. A formant-based synthesizer (i.e., feed it with "sound codes") is surprisingly small/compact -- tens of KB. Biggest issue is dealing with all the run-time math (esp if you don't have floats). OTOH, a diphone synthesizer may require several MB for the unit database. And, a fair bit of smarts piecing together adjacent diphones. If you can afford crude text-to-SOUND rules, you can trim that portion of the codebase to a few KB -- largely to encode the rules. Even those can be simplified if you are willing to shift some of the burden to the user/developer (e.g., replace "qu" with "KW" or "K", as appropriate, in the "text" fed to the TTS and eliminate those "q" rules). Skip prosody and you can save there, as well. [Low cost product, low expectations from user...]
Il 08/04/2015 22:24, Don Y ha scritto:
> On 4/8/2015 6:58 AM, pozz wrote: > >>>> The voice should be as understandable as possible. Of course, greater >>>> quality >>>> is better, but I don't need high fidelity quality. >>> >>> <grin> That really doesn't say much :-/ >>> >>> Have you listened to many synthetic voices? They range from *very* >>> natural >>> to "ick". >>> >>> Given that you appear to be pasting "compile-time" speech with >>> "run-time" >>> speech, are you willing to tolerate the sudden "voice/quality" change >>> that >>> will be apparent where you have "filled in the blanks" with run-time >>> utterances? I.e., you can have very natural compile-time speech that >>> is laced with potentially *crude* run-time phrases. >> >> You agree with me: high quality for "compile-time" sentences *and* for >> "run-time" senteces is better. But I don't need it. The device is for >> low-cost market, so the user won't have too much expectations. > > Only *you* can comment on your market and what it will accept. I'm just > pointing out that there *will* be a very noticeable "pieced together" > feel (sound) to it. > > Have you also considered just letting the user *record* his messages > (i.e., using his own voice via a microphone *or* "downloading" it into > the device from a "PC")?
This is exactly what my competitors already do, but I was thinking how to improve this. The final result/sound isn't good: you have a mix between very good words (maybe from a female voice) and the words pronounced by the user at the microphone (maybe a male user). The best option is to use the same TTS engine for "compile-time" words/sentences and "run-time" words. In this way, the result will not have quality gaps. But it isn't simple to embed a high-quality TTS engine. Another possibility is to use the same TTS engine with "two levels of quality". The high quality is used on desktop/developing computer to generate "compile-time" words/sentences. The low quality version should be embeddable in the device. In this way, some gaps in the quality can be heared, but I think the overall result would be good (at least, the engine uses the same voice).
>>> The user will obviously know where the "filled in blanks" occur in the >>> audio output (which may be acceptable to you). What might *not* be as >>> acceptable is the change in *quality*/intelligibility that results. >> >> It will be acceptable. The change in "quality" corresponds exactly to >> the >> customizable words. So the user understands what happens. > >>>> At the moment, I need only Italian language and single voice. Not >>>> customizable. > >> The user can type any sequence of chars, but he is encouraged to play >> and check >> the result. If it is too noisy, he can change the words. > > OK. In my application, the user has no "preview" capability. So, he has > to be able to recognize what the device (as a proxy) is trying to "tell" > him > regardless of the complexity of that (unconstrained) input. As such, I > have controls that allow him to replay messages, "spell" individual > words/numbers, change the characteristics of the speech (pitch, rate, etc.) > to be more intelligible, etc. > >>> But, can the user specify *any* word? "smartphone"? "technology"? >>> "disillusionment"? "apartheid"? >> >> Yes. > >> The user can write everything, but it is reasonable he writes simple >> words. > >>> So, "Rick" requires 38 bits (about 5 bytes) to encode (alond with its >>> pronunciation). At run-time, you need only convert the "sound codes" >>> into actual "audio waveforms" -- instead of having to convert the >>> textual representation of the name into the sound codes *and* then >>> into waveforms. >>> >>> [I have no idea how large your vocabulary will be so no idea how large >>> the dictionary would be.] >> >> I'm sure the user will want to use a words that isn't present in the >> dictionary. I'd prefer to avoid this way. > > Then your TTS rules will need to address every potential case. Note, > however, that if your rules are *intuitive*, users will quickly learn how > to misspell the text in order to get an acceptable pronunciation. E.g., > in English, the only (phonetic) use for the letter 'C' in the input > text is to represent the "CH" sound. All other C's can be replaced > with 'S' or 'K'. > > You can probably also eliminate a lot of the subtle differences in sounds > that would promote more naturalness. E.g., (in English), the 'N' sound > in "Next" is subtly different from that in "buttoN"; likewise, the 'L' > in "Let" vs. "piLL"; the 'R' in "Ready" vs. "tiRe"; the 'W' in "Which" > vs. "Wet"; etc. > > Find a word-list of "common" words (in Italian) and prepare to feed them > to your TTS to see how good/bad the resulting pronunciation. And, for > those > that are less than ideal, see if you can misspell them in ways that make > their pronunciations more acceptable. Finally, look at those misspellings > and see if a user could readily come to the same sort of realization > (*if* the pronunciation of the proper spelling was "bad enough" to warrant) > >>> Depending on the level of expertise of your users -- and the hoops >>> through >>> which they are willing to jump -- you could also direct them to specify >>> the phrases *using* those sound codes. I.e., force them to do the >>> "letter to sound" conversion in their heads -- possibly aided by >>> allowing >>> them to easily replay what they have just "typed": >>> "Hmmm... that 'i' sound needs to be shorter. Let me try..." >>> >>> Given the variation in how proper names are pronounced, this may well be >>> the best approach. E.g., my (english) ruleset would butcher "Alfio", >>> "Gabriella", etc. I'm not sure it would even handle "ciao" properly! >> >> No, the user will not have this kind of expertise. > > See above. Sorry, I can't comment on appropriate "bastardizations" for > Italian. But, in English, a "motivated user" can usually come up with > ways to coax a TTS into "uttering the sounds" that he'd like to hear. > > Important tip: be sure to encode some basic punctuation. People quickly > learn that they can influence "playback" if they insert a ',' to force a > small pause at a certain point in the text; a '.' for a longer pause; > etc. If you also tried to encode prosody (doubtful given your > description), > things like '!' and '?' could be artificially injected to influence that. > >>>> If it's too difficult to understand, he has the possibility to change >>>> the words >>>> with some other more understandable. >>> >>> But, you still have to have rules that allow *you* to come up with a >>> pronunciation. And, the user needs a way of coercing the device to >>> pronounce the word the way he *wants* it to be pronounced. >>> >>> Does "read" rhyme with "tweed" or "bed"? I.e., a user wanting it >>> to be pronounced in a particular way would have to misspell it "reed" >>> or "red" (assuming the device picked the "wrong" pronunciation). >>> >>> Allowing the user to enter "sound codes" avoids that problem. >> >> I understand, but it's difficult to explain to my users. It is >> simpler to >> explain him misspelling the word in such a way the final result is >> similar to >> the sound he wants to hear. > > The result will be the same -- *if* your ruleset is simple/obvious. E.g., > "'c' only makes sense in 'ch', else 'k'". E.g., I would encode "ciao" as > "chow" to get the pronunciation I (English) sought. > >>> So, you need rules that allow <digit>[<digit>] to be pronounced one way >>> while <digit>[<digit>]:<digit><digit> is pronounced another. >>> >>> Presumably, the mechanism by which the user specifies the "words" that >>> he wants spoken will disallow any digits in that "text specification"? >>> >>> (Likewise, punctuation and other special symbols?) >> >> The user will never needs to customize texts with numbers or times. >> Numbers are >> managed at compile time. > > So, he'd never say "The air conditioner in room 307 has just switched off"? > Or, if he wanted to do so, he would be expected to write it as "The air > conditioner in room three oh seven has just switched off". > >>>>> Start at CMU's Hephaestus page. You might also want to look into >>>>> "dialog >>>>> systems". Also, don't forget to research intelligibility testing >>>>> (e.g., >>>>> modified rhyme test, anomalous sentences, etc.) as having speech that >>>>> isn't intelligible is like having an LED indicator that's "burned >>>>> out"! >>>>> >>>>> Invest some time in understanding the "listening prowess" of your >>>>> target >>>>> audience. > >>>> Of course, the user has a limited space to write words/sentences. >>> >>> I'm not talking about specifying the text. Rather, I am addressing your >>> comment about "spend some time to generates the short message to play". >>> I.e., starting with "text", you'd have to convert the graphemes to >>> phonemes; then, synthesize the audio waveform (however your output >>> device expects to be fed) from these sound codes and prosodic envelope. >>> >>> The bulk of the "work" (CPU cycles) is in the creation of the waveforms. >>> If you can't "keep up" with real time, then you need to be able to >>> buffer >>> the waveform while you create it -- and before you "utter" it. Yet, >>> once >>> you *start* to "speak" (i.e., push signal out the speaker), you probably >>> can't arbitrarily stop/pause without affecting intelligibility (i.e., >>> you'd >>> have to make sure you only paused at word boundaries; never in the >>> middle >>> of a word) >>> >>> So, you need a buffer for all that "analog data". The number of >>> characters >>> in the "input word" has little to do with the duration of the utterance >>> that will ultimately result. >>> >>> E.g., the /IH/ vowel sound (rIck) is probably half the duration of >>> the /AY/ vowel sound (bIte). Note how long your "mouth is engaged" >>> saying the two words. Or, "ewe"/"you" vs. "hit". (now you see why >>> we call them "short" and "long" vowels! :> ) >> >> I understand, but you agree with me that a short text (a small number >> of chars) >> corresponds to a short waveform duration. I can calculate a worst >> case for a >> certain number of chars. > > I think you will find this isn't as obvious/easy as you expect. You may > find > it easier to just "run a tighter loop" -- possibly dismissing other > activities > at the time -- and synthesize on the fly. This can dramatically shrink > your memory (buffer) requirements. If you are willing to accept "crude" > for > the "filled in blanks", then a lot of processing can be skipped (e.g., > prosody -- just rattle those things off in a monotone) > >>> If you (your users) can tolerate the effort of "specifying sounds" >>> instead of "specifying letters", it might be best to let them >>> specify the text in that manner. >>> >>> At the very least, you could run the text-to-sound portion of the >>> algorithm as soon as they have typed in the desired text and store >>> the *sound* codes at that time -- to eliminate the effort of >>> doing the conversion at "run time" (i.e., when the actual spoken >>> output is *required*). >>> >>> Before you go too far down this road, you may want to explore some of >>> the >>> on-line synthesizers to get a feel for how robust they are, the quality >>> of their voices, etc. (many are diphone based; you can actually make >>> the synthesizer sound like a particular -- REAL -- *person*!) >>> >>> Then, explore some of the "cheaper" approaches (i.e., those that you are >>> more likely to employ in your implementation). Get a feel for how >>> the costs change -- as well as the "quality". >>> >>> At the very least, you'll get an appreciation for how much processing >>> we automatically do when handling "combinations of characters" in >>> particular, specific contexts. >> >> Yes, I'm trying to understand text-to-speech world, but it seems too >> difficult >> for me. > > It's not easy. Speech (and language) have lots of subtleties that we > take for granted in our daily life. Why "nickEL", yet "pickLE"? > Hopefully, Italian (as a language) is "more regular" than English. > (ISTR some of the Scandinavian languages are very "regular") > >> I hoped it was possible to embed some ready-to-use TTS libraries (free >> of charge or after payment) as source codes or object files, without >> being a >> TTS expert. It seems, this isn't the case. > > You can play with flite but I think you will find it too large for your > needs. There are several other "open" TTS implementations (though not sure > how well suited to Italian their rulesets would be) but, most suffer from > the same "lack of concern for resources" that you might encounter in a > deeply embedded product.
I have already seen the flite project and I'm studying it. It seems there's an italian version too. Maybe this could be a good starting point.
> I'm starting my *third* version (different approach than either of the > first > two) at a "lean TTS" and suspect I will be disappointed with that, as well > (primarily due to the unconstrained vocabulary consequences -- it's always > easy to come up with "typical" things that are difficult to handle WITHOUT > making the algorithms incredibly complex) > > Good luck! > --don
Thank you.
> >> Anyway, Thank you very much for your explanations and time. >> >>> >>>>> OTOH, running the synthesizer and playback concurrently allows you to >>>>> shrink your buffer (to a size that just handles jitter in the >>>>> algorithms) >>>>> and speak phrases of "unlimited" length. >>>>> >>>>> [Of course, encoding prosody on-the-fly gets trickier] >> >
The 2026 Embedded Online Conference