EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Embeddable text-to-speech

Started by pozz April 7, 2015
Il 09/04/2015 01:18, Don Y ha scritto:
> Hi Robert, > > On 4/8/2015 3:20 PM, Robert Wessel wrote: >> On Wed, 08 Apr 2015 15:58:27 +0200, pozz <pozzugno@gmail.com> wrote: > >> I'm a bit unclear on your scenario. > > <grin> Join the club! ISTM that the OP wants to have (reasonably) > high quality *canned* phrases/sentences into which the user can "salt" > user-specific data/phrases: > "The _____________ device has reported a power failure." > "Your _______ door has been opened!" > "The _________ seems to be running too hot." > The canned portion can obviously be "processed" (whatever THAT means) at > compile-time as they are invariant. But, the "blanks" need to be created > "at CONFIGURATION time" (which, presumably, is somewhere between > compile-time > and run-time). > > Further, the content for those "blanks" is relatively unconstrained and > may include "words" that defy traditional TTS algorithms. E.g., names > (how do *you* pronounce "Berlin"?).
Perfect description of my situation :-)
>> Are you going to be generating the speech offline from the device, and >> then installing the resulting sound file (.wav, etc.) on the device? >> If so, there are a number of possible ways to do that without too much >> work. >> >> Windows, for example, has a built in TTS system, and an API an >> application can use ("SAPI"). An obvious use case is with direct >> output to the user, but you can also write output to a .wav file. >> >> https://msdn.microsoft.com/en-us/library/ms717065(v=vs.85).aspx > > I really don't understand the need for a compile time TTS! Why not > just *record* the speech and then encode it <however>? Why let an > (inferior) algorithm try to come up with "natural sounding" speech > when you could find a genuine human being to do this??
Because it isn't difficult to record a good voice with a microphone at a computer, if you aren't a "voice talent". Usually you'd like to hear a female voice, but I am male. I have to engage a good female voice and try to record something and fix the result. Maybe after some months I notice that is better to say "Dear user" and not "Hello user" and I have to call the woman again, but maybe she isn't available anymore.
> I am trying to understand a situation where "storing" a message in > "audio" form makes sense given that he plans on having some TTS capability > in the product. AFAICT, the only advantage comes if you can't do the > synthesis on-the-fly and have to resort to building output waveforms > in volatile memory at run-time; this hybrid approach could let you shrink > the amount of such memory in favor of "ROM" with the canned > representations. > > [OTOH, a cleverer approach could synthesize everything "in small word > groups" > and piece them together -- with pauses between] > > ISTM, that storing the canned portions in the same "bastardized spellings" > that were discussed up-thread and letting the TTS synthesize *everything* > would be the better approach. E.g., I run *all* of my "canned text" > through the TTS engine in my device just to eliminate the burden on the > developer of having to "precompile" the "spoken output".
If the result is good enough, this could be a good approach. Anyway it's a pity to use a low-quality embedded TTS engine for pronounce the 99% percent of sententes, where it can be used only for the 1%.
> But, the OP understands his market better than I... > >> Windows comes with a built in TTS engine, which does a pretty good job >> for general use (it's the basis for MS's default screen reader), and >> has likely had a ton more work put into parsing an analysis of text >> than you could justify. But if it's not good enough, there are third >> party plug-in TTS engines that you can add as well. These usually add >> other voices and additional customization options. > > Let the *user* download a .WAV file from *his* PC. Then, just concentrate > on being able to reproduce those files accurately (given that they may > contain "wonkiness"). Reserve a portion of your flash to hold messages? > Add something to verify that portion of the flash contains something that > *looks* like a message?? (hey, user may opt to store sound-effects instead > of actual "spoken speech") > > [There are some low vision aids that just let the user record their own > voice in place of accepting text for <whatever>. Then, the device simply > plays back their recording when they want to "access" that "data": > "This is a can of corn niblets"; "Appointment at dentist on Friday"; etc.] > >> Even if you weren't primarily doing your management on a Windows >> machine, you ought to be able to toss a Windows box or two in a corner >> as a TTS .wav file server. >> >> I believe MS uses the same SAPI on their mobile systems as well. >> >> I'm sure similar exists for Linux. >> >> There is a TTS package and API for Android. That might be usable, >> even if you have to run Android on a machine as a server. My >> understanding is that it uses the same text analysis engine as Google >> Translate does, and Google translate has a TTS option as well (use do >> English-to-English as the translation and select My guess it's that >> the same TTS back end as what in the Android package). It may well be >> that there's an API or service you can use in there somewhere. >> >> And the Android version is presumably open source, although I'm sure >> it's going to be a handful. >> >> Even if you weren't planning on doing this offline, there are some >> advantages to that, especially if the device (or management >> application) has internet access - there's a big lump of code you >> don't have to distribute and run on the device. > > The bulk of the code involved in TTS lies in the "rules" by which text > is evaluated in context, etc. A formant-based synthesizer (i.e., feed > it with "sound codes") is surprisingly small/compact -- tens of KB. > Biggest issue is dealing with all the run-time math (esp if you don't > have floats). > > OTOH, a diphone synthesizer may require several MB for the unit database. > And, a fair bit of smarts piecing together adjacent diphones. > > If you can afford crude text-to-SOUND rules, you can trim that portion > of the codebase to a few KB -- largely to encode the rules. Even those > can be simplified if you are willing to shift some of the burden to the > user/developer (e.g., replace "qu" with "KW" or "K", as appropriate, in > the "text" fed to the TTS and eliminate those "q" rules). Skip prosody > and you can save there, as well. > > [Low cost product, low expectations from user...]
On 4/9/2015 1:01 AM, pozz wrote:
> Il 08/04/2015 22:24, Don Y ha scritto: >> On 4/8/2015 6:58 AM, pozz wrote: >> >>>>> The voice should be as understandable as possible. Of course, greater >>>>> quality >>>>> is better, but I don't need high fidelity quality. >>>> >>>> <grin> That really doesn't say much :-/ >>>> >>>> Have you listened to many synthetic voices? They range from *very* >>>> natural >>>> to "ick". >>>> >>>> Given that you appear to be pasting "compile-time" speech with >>>> "run-time" >>>> speech, are you willing to tolerate the sudden "voice/quality" change >>>> that >>>> will be apparent where you have "filled in the blanks" with run-time >>>> utterances? I.e., you can have very natural compile-time speech that >>>> is laced with potentially *crude* run-time phrases. >>> >>> You agree with me: high quality for "compile-time" sentences *and* for >>> "run-time" senteces is better. But I don't need it. The device is for >>> low-cost market, so the user won't have too much expectations. >> >> Only *you* can comment on your market and what it will accept. I'm just >> pointing out that there *will* be a very noticeable "pieced together" >> feel (sound) to it. >> >> Have you also considered just letting the user *record* his messages >> (i.e., using his own voice via a microphone *or* "downloading" it into >> the device from a "PC")? > > This is exactly what my competitors already do, but I was thinking how to > improve this. > > The final result/sound isn't good: you have a mix between very good words > (maybe from a female voice) and the words pronounced by the user at the > microphone (maybe a male user).
My point was: - use a human being instead of a synthetic voice - hey, why not use the CUSTOMER?? To record the canned portions *and* the "filled in blanks"! I.e., just let the customer record the *entire* message -- canned and "filled in blanks" For example, I have a device, here, that is basically a portable barcode scanner with "audio output". A user scans a barcode (e.g., on a can of corn niblets), the device looks up the barcode (UPC label) in an internal database and then speaks the identification associated with that label: "Corn niblets, Green Giant (brand), 12 oz" using a synthetic voice selected by the user. But, there are occasions where the scanned label is not present in the database. For these, the user can *record* their own "annotation" which will then be tied to that particular barcode label: "My favorite black sweater" Thereafter, whenever that same label is encountered, the device replays the user's annotation (in *their* voice). This is far more convenient than having the user *type* a formal description of the item (which the speech synthesizer could then speak).
> The best option is to use the same TTS engine for "compile-time" > words/sentences and "run-time" words. In this way, the result will not have > quality gaps. But it isn't simple to embed a high-quality TTS engine.
Exactly. But, there are ways you can work-around this. As I mentioned (elsewhere this thread), much of the complexity of a TTS lies in the text-to-sound algorithms. I.e., knowing when "read" is to be pronounced as "red" vs. "reed"; knowing that strings of digits of the form ###-#### are telephone numbers (in which case, each digit should be spoken individually with a pause inserted for the '-') while XXXX is likely a *year* (esp if a month name is noted "nearby" and/or the value encoded is "reasonably current"); adding prosodic features; etc. It might not be unreasonable to "require" the user to determine how things are pronounced (as discussed in past message). This eliminates the need for much of the code that bloats OTS TTS implementations. The most difficult part of listening to synthetic speech is dealing with incorrect pronunciations. Unlike *print*, it's hard for most people to "rewind" their memories of what they just heard -- especially while the device *continues* to speak! (our "aural" memories are much too short; we remember speech only *after* recognizing the individual words! So, if you are stumped by an unexpected mispronunciation, you have to rely on your memory of the raw *sounds*) The other big issue listening to synthetic speech is prosody and cadence. My comments re: pauses and punctuation can allow the user to artificially create a better sounding sentence (by injecting pauses "for best effect"). Operating in a pure monotone is acceptable for infrequent exposure -- you wouldn't want to listen to such speech "all day long" (your ears literally get "tired" in much the same way that your eyes tire after a long day of reading print). TSTR that you can use markup languages with flite/festival, etc. If so, you may want to try *deliberately* creating some input text that forces the synthesizer into a "monotone mode" (i.e. deliberately remove all inflection). Then, try replacing the voices with different technologies: diphone, mbrola, formant, etc. and see how you like the intelligibility of the result. WITH AN EYE TOWARDS SYNTHESIZING THE ENTIRE MESSAGE(S).
> Another possibility is to use the same TTS engine with "two levels of > quality". The high quality is used on desktop/developing computer to generate > "compile-time" words/sentences. The low quality version should be embeddable > in the device. > In this way, some gaps in the quality can be heared, but I think the overall > result would be good (at least, the engine uses the same voice).
Again, the difference won't be in the "characteristics" of the voice. But, rather, in the quality of the *pronunciation*. I.e., the sounds that the synthesizer is directed to utter based on the analysis of the input text. You've not indicated your resource budget. You might give some of the diphone voices a listen and see how "natural" you think they are -- esp when configured in a "monotone" mode. You can then select a "real" person as the model for the voice you choose (incl yourself!). Part of the problem with formant synthesizers (much lower resource requirements) is sorting out how to tweek the multitude of *parameters* to get a voice that sounds the way you'd like it to sound. With a diphone synthesizer, you just find someone who's voice you *like*! :>
>>> I hoped it was possible to embed some ready-to-use TTS libraries (free >>> of charge or after payment) as source codes or object files, without >>> being a >>> TTS expert. It seems, this isn't the case. >> >> You can play with flite but I think you will find it too large for your >> needs. There are several other "open" TTS implementations (though not sure >> how well suited to Italian their rulesets would be) but, most suffer from >> the same "lack of concern for resources" that you might encounter in a >> deeply embedded product. > > I have already seen the flite project and I'm studying it. It seems there's an > italian version too. Maybe this could be a good starting point.
Flite is big -- despite its claim to being small! There are lots of other "open" synthesizers out there to poke at. Many years ago, there was a crude "say.com" for PC's that was cheap and dirty in its implementation. You can also find many implementations of the Klatt synthesizer (but this doesn't include the text-to-sound algorithms). You might also be able to find commercial demos that you could evaluate to get a feel for how *good* it can be (which gives you a yardstick against which to evaluate your particular implementation). I think DECTalk was sold to Fonix many years ago. From there, it may have moved to Sensimetrics? (google would be your friend, here) It costs nothing to play with existing implementations (even COTS) and get a feel for what that technology has to offer.
Il 09/04/2015 12:36, Don Y ha scritto:
> On 4/9/2015 1:01 AM, pozz wrote: >> Il 08/04/2015 22:24, Don Y ha scritto: >>> On 4/8/2015 6:58 AM, pozz wrote: >>> >>>>>> The voice should be as understandable as possible. Of course, >>>>>> greater >>>>>> quality >>>>>> is better, but I don't need high fidelity quality. >>>>> >>>>> <grin> That really doesn't say much :-/ >>>>> >>>>> Have you listened to many synthetic voices? They range from *very* >>>>> natural >>>>> to "ick". >>>>> >>>>> Given that you appear to be pasting "compile-time" speech with >>>>> "run-time" >>>>> speech, are you willing to tolerate the sudden "voice/quality" change >>>>> that >>>>> will be apparent where you have "filled in the blanks" with run-time >>>>> utterances? I.e., you can have very natural compile-time speech that >>>>> is laced with potentially *crude* run-time phrases. >>>> >>>> You agree with me: high quality for "compile-time" sentences *and* for >>>> "run-time" senteces is better. But I don't need it. The device is for >>>> low-cost market, so the user won't have too much expectations. >>> >>> Only *you* can comment on your market and what it will accept. I'm just >>> pointing out that there *will* be a very noticeable "pieced together" >>> feel (sound) to it. >>> >>> Have you also considered just letting the user *record* his messages >>> (i.e., using his own voice via a microphone *or* "downloading" it into >>> the device from a "PC")? >> >> This is exactly what my competitors already do, but I was thinking how to >> improve this. >> >> The final result/sound isn't good: you have a mix between very good words >> (maybe from a female voice) and the words pronounced by the user at the >> microphone (maybe a male user). > > My point was: > - use a human being instead of a synthetic voice > - hey, why not use the CUSTOMER?? To record the canned portions *and* > the "filled in blanks"! I.e., just let the customer record the *entire* > message -- canned and "filled in blanks" > > For example, I have a device, here, that is basically a portable barcode > scanner with "audio output". A user scans a barcode (e.g., on a can of > corn niblets), the device looks up the barcode (UPC label) in an internal > database and then speaks the identification associated with that label: > "Corn niblets, Green Giant (brand), 12 oz" > using a synthetic voice selected by the user. > > But, there are occasions where the scanned label is not present in the > database. For these, the user can *record* their own "annotation" which > will then be tied to that particular barcode label: > "My favorite black sweater" > Thereafter, whenever that same label is encountered, the device replays > the user's annotation (in *their* voice). This is far more convenient > than having the user *type* a formal description of the item (which the > speech synthesizer could then speak).
I can't use this approach. The gadget is an interactive voice response (IVR), so the sentences that it should say are: - Press 1 to change settings - Press 2 to read status - Press 3 to read firmware version - ... Those kind of sentences can't be recorded by the user.
>> The best option is to use the same TTS engine for "compile-time" >> words/sentences and "run-time" words. In this way, the result will >> not have >> quality gaps. But it isn't simple to embed a high-quality TTS engine. > > Exactly. > > But, there are ways you can work-around this. > > As I mentioned (elsewhere this thread), much of the complexity of a TTS > lies in the text-to-sound algorithms. I.e., knowing when "read" is to > be pronounced as "red" vs. "reed"; knowing that strings of digits of the > form ###-#### are telephone numbers (in which case, each digit should be > spoken individually with a pause inserted for the '-') while XXXX is > likely a *year* (esp if a month name is noted "nearby" and/or the value > encoded is "reasonably current"); adding prosodic features; etc. > > It might not be unreasonable to "require" the user to determine how > things are pronounced (as discussed in past message). This eliminates > the need for much of the code that bloats OTS TTS implementations. > The most difficult part of listening to synthetic speech is dealing > with incorrect pronunciations. Unlike *print*, it's hard for most > people to "rewind" their memories of what they just heard -- especially > while the device *continues* to speak! (our "aural" memories are > much too short; we remember speech only *after* recognizing the individual > words! So, if you are stumped by an unexpected mispronunciation, you > have to rely on your memory of the raw *sounds*) > > The other big issue listening to synthetic speech is prosody and cadence. > My comments re: pauses and punctuation can allow the user to artificially > create a better sounding sentence (by injecting pauses "for best effect"). > Operating in a pure monotone is acceptable for infrequent exposure -- you > wouldn't want to listen to such speech "all day long" (your ears literally > get "tired" in much the same way that your eyes tire after a long day of > reading print). > > TSTR that you can use markup languages with flite/festival, etc. If so, > you may want to try *deliberately* creating some input text that forces > the synthesizer into a "monotone mode" (i.e. deliberately remove all > inflection). Then, try replacing the voices with different technologies: > diphone, mbrola, formant, etc. and see how you like the intelligibility > of the result. WITH AN EYE TOWARDS SYNTHESIZING THE ENTIRE MESSAGE(S). > >> Another possibility is to use the same TTS engine with "two levels of >> quality". The high quality is used on desktop/developing computer to >> generate >> "compile-time" words/sentences. The low quality version should be >> embeddable >> in the device. >> In this way, some gaps in the quality can be heared, but I think the >> overall >> result would be good (at least, the engine uses the same voice). > > Again, the difference won't be in the "characteristics" of the voice. > But, rather, in the quality of the *pronunciation*. I.e., the sounds > that the synthesizer is directed to utter based on the analysis of > the input text. > > You've not indicated your resource budget. You might give some of the > diphone voices a listen and see how "natural" you think they are -- esp > when configured in a "monotone" mode. You can then select a "real" > person as the model for the voice you choose (incl yourself!). > > Part of the problem with formant synthesizers (much lower resource > requirements) is sorting out how to tweek the multitude of *parameters* > to get a voice that sounds the way you'd like it to sound. With a > diphone synthesizer, you just find someone who's voice you *like*! :> > >>>> I hoped it was possible to embed some ready-to-use TTS libraries (free >>>> of charge or after payment) as source codes or object files, without >>>> being a >>>> TTS expert. It seems, this isn't the case. >>> >>> You can play with flite but I think you will find it too large for your >>> needs. There are several other "open" TTS implementations (though >>> not sure >>> how well suited to Italian their rulesets would be) but, most suffer >>> from >>> the same "lack of concern for resources" that you might encounter in a >>> deeply embedded product. >> >> I have already seen the flite project and I'm studying it. It seems >> there's an >> italian version too. Maybe this could be a good starting point. > > Flite is big -- despite its claim to being small! There are lots of > other "open" synthesizers out there to poke at. Many years ago, there > was a crude "say.com" for PC's that was cheap and dirty in its > implementation. You can also find many implementations of the Klatt > synthesizer (but this doesn't include the text-to-sound algorithms). > > You might also be able to find commercial demos that you could evaluate > to get a feel for how *good* it can be (which gives you a yardstick > against which to evaluate your particular implementation). I think > DECTalk was sold to Fonix many years ago. From there, it may have > moved to Sensimetrics? (google would be your friend, here) > > It costs nothing to play with existing implementations (even COTS) > and get a feel for what that technology has to offer. >
On 4/9/2015 6:13 AM, pozz wrote:
> Il 09/04/2015 12:36, Don Y ha scritto: >> On 4/9/2015 1:01 AM, pozz wrote: >>> Il 08/04/2015 22:24, Don Y ha scritto: >>>> On 4/8/2015 6:58 AM, pozz wrote: >>>> >>>>>>> The voice should be as understandable as possible. Of course, >>>>>>> greater >>>>>>> quality >>>>>>> is better, but I don't need high fidelity quality. >>>>>> >>>>>> <grin> That really doesn't say much :-/ >>>>>> >>>>>> Have you listened to many synthetic voices? They range from *very* >>>>>> natural >>>>>> to "ick". >>>>>> >>>>>> Given that you appear to be pasting "compile-time" speech with >>>>>> "run-time" >>>>>> speech, are you willing to tolerate the sudden "voice/quality" change >>>>>> that >>>>>> will be apparent where you have "filled in the blanks" with run-time >>>>>> utterances? I.e., you can have very natural compile-time speech that >>>>>> is laced with potentially *crude* run-time phrases. >>>>> >>>>> You agree with me: high quality for "compile-time" sentences *and* for >>>>> "run-time" senteces is better. But I don't need it. The device is for >>>>> low-cost market, so the user won't have too much expectations. >>>> >>>> Only *you* can comment on your market and what it will accept. I'm just >>>> pointing out that there *will* be a very noticeable "pieced together" >>>> feel (sound) to it. >>>> >>>> Have you also considered just letting the user *record* his messages >>>> (i.e., using his own voice via a microphone *or* "downloading" it into >>>> the device from a "PC")? >>> >>> This is exactly what my competitors already do, but I was thinking how to >>> improve this. >>> >>> The final result/sound isn't good: you have a mix between very good words >>> (maybe from a female voice) and the words pronounced by the user at the >>> microphone (maybe a male user). >> >> My point was: >> - use a human being instead of a synthetic voice >> - hey, why not use the CUSTOMER?? To record the canned portions *and* >> the "filled in blanks"! I.e., just let the customer record the *entire* >> message -- canned and "filled in blanks" >> >> For example, I have a device, here, that is basically a portable barcode >> scanner with "audio output". A user scans a barcode (e.g., on a can of >> corn niblets), the device looks up the barcode (UPC label) in an internal >> database and then speaks the identification associated with that label: >> "Corn niblets, Green Giant (brand), 12 oz" >> using a synthetic voice selected by the user. >> >> But, there are occasions where the scanned label is not present in the >> database. For these, the user can *record* their own "annotation" which >> will then be tied to that particular barcode label: >> "My favorite black sweater" >> Thereafter, whenever that same label is encountered, the device replays >> the user's annotation (in *their* voice). This is far more convenient >> than having the user *type* a formal description of the item (which the >> speech synthesizer could then speak). > > I can't use this approach. The gadget is an interactive voice response (IVR), > so the sentences that it should say are: > - Press 1 to change settings > - Press 2 to read status > - Press 3 to read firmware version > - ... > > Those kind of sentences can't be recorded by the user.
Why not? Why can't your "setup" directions lead the user through this? Or, why can't you store "factory default messages" for these and the *other* messages that you described (e.g., the air conditioner) and then let the user change all/none as he sees fit? Or, *only* allow him to change the "air conditioner"-type messages? "Hello User, your air conditioner in location #1 has just switched off." For some folks, this may be acceptable (they just have to remember the identity of "location 1") [But, then again, it's been argued told people don't have to "customize things"...] You have no problem mixing two different *types* (quality, sources, etc.) of speech *synthesis*; why not let the user decide if he wants to retain some messages in the "factory voice" while changing those that require customization?
On 4/9/2015 1:07 AM, pozz wrote:
> Il 09/04/2015 01:18, Don Y ha scritto:
>>> Are you going to be generating the speech offline from the device, and >>> then installing the resulting sound file (.wav, etc.) on the device? >>> If so, there are a number of possible ways to do that without too much >>> work. >>> >>> Windows, for example, has a built in TTS system, and an API an >>> application can use ("SAPI"). An obvious use case is with direct >>> output to the user, but you can also write output to a .wav file. >>> >>> https://msdn.microsoft.com/en-us/library/ms717065(v=vs.85).aspx >> >> I really don't understand the need for a compile time TTS! Why not >> just *record* the speech and then encode it <however>? Why let an >> (inferior) algorithm try to come up with "natural sounding" speech >> when you could find a genuine human being to do this?? > > Because it isn't difficult to record a good voice with a microphone at a > computer, if you aren't a "voice talent".
Because it *is*? Or, *isn't*? I think user's charged with this "responsibility" would be willing to accept whatever quality in which they are willing to invest. I.e., some folks will make one pass at this while others will refine their recordings. People have no problem recording outgoing messages on answering machines, etc.
> Usually you'd like to hear a female voice, but I am male. I have to engage a > good female voice and try to record something and fix the result. > Maybe after some months I notice that is better to say "Dear user" and not > "Hello user" and I have to call the woman again, but maybe she isn't available > anymore.
Understandable -- *if* you insist on generating the speech at the factory. Note that speech synthesis doesn't always give you a choice as to the actual characteristics of the voice you'll employ: "pick from among these 12 voices", etc. You have far more flexibility in selecting a specific voice if you interview "voice talent".
>> I am trying to understand a situation where "storing" a message in >> "audio" form makes sense given that he plans on having some TTS capability >> in the product. AFAICT, the only advantage comes if you can't do the >> synthesis on-the-fly and have to resort to building output waveforms >> in volatile memory at run-time; this hybrid approach could let you shrink >> the amount of such memory in favor of "ROM" with the canned >> representations. >> >> [OTOH, a cleverer approach could synthesize everything "in small word >> groups" >> and piece them together -- with pauses between] >> >> ISTM, that storing the canned portions in the same "bastardized spellings" >> that were discussed up-thread and letting the TTS synthesize *everything* >> would be the better approach. E.g., I run *all* of my "canned text" >> through the TTS engine in my device just to eliminate the burden on the >> developer of having to "precompile" the "spoken output". > > If the result is good enough, this could be a good approach. Anyway it's a > pity to use a low-quality embedded TTS engine for pronounce the 99% percent of > sententes, where it can be used only for the 1%.
Creating speech from text generally boils down to: - text normalization * "expanding" abbreviations ("Mr" -> "Mister"; "etc" -> "etcetera"; etc.) * spoken punctuation ('&' -> "and"; '%' -> "percent"; etc.) * encoding punctuation (',' -> phrase boundary; '.' -> sentence; etc.) * handling numerics = decimal numerals (1234; 41.09; .9; 0.75; 1,000,000; 000.3; 3.000) = ordinals ("1st"; "2nd"; "3rd"; "n-th"; etc.) = currency ($123; $123.00; $123.45; 37&#4294967295;; RMB 37; etc.) = time/date (note cultural differences, here!) = Roman numerals ("Henry VII"; "Tom Smith II"; etc.) = non-decimal radices ("0xDeadBeef"; "027"; "16rFF"; etc.) - word decomposition (stripping affixes to determine the root word as an aid to pronunciation ("flies" -> "fly"+'s'; contrast "pennies" -> "penny"+'s'; - letter-to-sound mapping - stress assignment ("Berlin" -> "BUR lin" vs. "bur LIN") - prosody (F0 contours at phrase and sentence level) [I may have missed a step or two... :> Too early in the morning!] And, of course, any application-specific additions ("4K7 resistor"). This latter is often where "the bear" wins (how do you speak seemingly unrelated orthography?? e.g., diagnostics emitted by programs) The bulk of the code in a *good* TTS deals with the first two ('-') items. And, it is also where the bulk of the screwups eventually occur! Getting to a point where you can *reliably* figure out what sounds should be imposed on the "text" requires knowledge of what is actually being *said*! "Dr. Jones lives on Jones Dr." (Doctor Jones lives on Jones Drive) "Nurse, please start an IV on Henry" (Nurse, please start an I V on Henry) Once the text has been "disambiguated", mapping letters to sound is considerably *less* problematic (though still challenging). Stress assignment and other prosodic features largely affect the naturalness of the speech. But, in their absence, a "motivated listener" can still discern *intent*. Especially if the listener is familiar with the context. For the messages over which *I* (and, by analogy, *you*) have control, I choose representations with which I know *my* synthesizer will perform well. As the text from which the speech is synthesized is largely algorithmically generated printf("The current volume level is %d %%.", volume); printf("Your MAC is %02x:%02x:%02x:%02x:%02x:%02x.", ...); I can bias the algorithms in the "front end" of the TTS to exploit that (i.e., it is far *less* likely for me to encounter "numbers" of the form "1,234,567" than "1234567") OTOH, there is the potential for some text (from external sources) that may not have been chosen with TTS -- esp *my* TTS! -- in mind: "521 google.com does not accept mail" Yet, I still need to be able to convey their content to the user, unaltered. In *your* case, you can similarly choose messages that "convert" effectively. And, can fold much of the text normalization (etc.) into your compile-time actions. E.g., "M A C" instead of "MAC" (so you don't have to be able to recognize that abbreviation -- yet still cause it to be "spoken" properly) Likewise, you can "game" the pronunciation phase of the algorithm by deliberately misspelling the desired text with foreknowledge of how *your* algorithm operates. E.g., Americans pronounce many "ise" words as "ize" while Brits treat it as a softer 's'; "Susanne" becomes "Suzanne"; etc. I contend that motivated users could easily make the same sorts of adjustments in how they define *their* message content. Thus, greatly simplifying your effort/algorithms. Finally, if spoken messages are short and/or infrequent, you can probably omit the stress assignment and prosodic features and just speak in a monotone. Or, impose only the crudest processing in this regard. You then end up with a single approach to speaking *everything* -- instead of trying to marry two different implementations/technologies. As I said, before: play with some OTS (commercial or otherwise) synthesizers and see what they sound like "crippled". The Reading Machine had a *dreadful* synthesizer (Votrax 6.3) yet folks would learn to listen to (i.e. "tolerate") it for hours at a time as it was the only game in town! :-/
The 2026 Embedded Online Conference