EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Constrained vocabulary speech synthesis

Started by Don Y June 3, 2014
Hi,

I have a "fallback" speech synthesizer that is used
when the primary speech synthesizer is unavailable.
Depending on the user's abilities, this may be the
*only* output modality available to him/her (i.e.,
he/she may not be able to perceive other available
modalities) so *everything* communicated to the
user must (potentially) pass through this channel.

As the synthesizer is only intended to be used when
the system is operating in a degraded mode, it doesn't
have to resolve a limitless vocabulary.  _For_the_most_
part_, I have complete control over what it will be
required to speak.  So, I can pass text to it that I
know to be devoid of characteristics that it would
be unable to handle "properly".

For example, "The Polish housekeeper who works for
Dr. Stephens in his house on Stevens Dr in Phoenix
bought some furniture polish for the credenza in his
office."

[with a tiny bit of effort, you can imagine lots of
similar constructs that require significant knowledge
of grammar, PoS, and other context to "get right".  Let
alone oddities like Billerica, Worcester, Usa, etc.]

But, there are other (external) text sources that I
can't as easily constrain.  So, I have to make a best
effort to cover those (unknown) inputs while not unduly
burdening the implementation (Sorry, Billerica!).

Adding context to the pronunciation, prosody, etc.
algorithms gets expensive, *fast*.  This is a small,
highly portable device with very limited resources
(CPU, memory, etc.) and extremely low power requirements
(has to operate for ~16 hours with a very small battery).

I am happy with the text rules that I have put in place.
They cover most "typical" input that the synthesizer is
likely to encounter.  Obviously, input that isn't
grammatically correct can be handled however the algorithm
likes (e.g., "The elephant are read?").

Recall that this synthesizer sees limited use -- so, the
user is *probably* unaccustomed to its quirks and other
idiosyncrasies.  Hopefully, the user *never* hears it
speak!  But, if he is in a situation where he is relying
on its speech, he's probably already annoyed (because
something else is "not working").  Encountering something
like "411 Length Required" probably won't find him very
willing to understand what was *intended* by that terse
message.

However, "numbers" seem to really benefit from context.
Often, a tiny bit of context is sufficient to enhance
the pronunciation (and, thus, comprehension).  But,
other times, you really need to understand what is
being said to know how best to speak the "number(s)".
E.g., "The 2300 block of State Street".

And, (from surveying users) there appear to be cultural
differences in how things (like numbers) are spoken.
E.g., "oh" vs. "zero"  (which even seems to vary *within*
a speaker's ruleset!), how/when numbers are read off as
srings of digits, use of "and" as a connective in numeric
values ("three hundred and ten" vs. "three hundred ten"),
and the value represented by 1,000,000,000.

The cop-out approach is just to recite strings of digits,
*always*.  But, try listening to "data" presented in this
form for even a few moments and you'll see how silly that
approach is!

"Your IP address is 10.0.1.223"
"Volume level 18 of 24"
"MAC address 23:C0:11:00:14:89"
"Signal strengths 23.1, 18.6, 8.5 and 33.0"
"Scheduled server maintenance at 03:00"
"Battery time remaining 3:12"
"Contact Dr Smith at (888) 555-1212 x3-1022"

I *don't* want to put any (other) signaling/control
information "in band" relying, instead, on *limited*
context to resolve these issues.

For example, requiring times of day to be indicated
with AM/PM (so "03:12" is NOT a time of day) and time
intervals *without* ("three hours and 12 minutes").
At the same time, I don't want to unnaturally burden
the algorithms that *create* (emit) these text strings
E.g., requiring all numeric values to be in scientific
notation or, to embed separators every three digits
(imagine how tedious it would be to have to process
numbers as *strings* in order to properly place commas
to separate thousands, millions, etc.).

Finally, I don't want to force unnatural presentations
that a user employing a *different* output modality
(e.g., a video display) would find tedious.  Imagine
requiring the text source to pass input of the form:
"Your Eye Pee address is ten dot zero dot one dot two
two three".

My questions:
- what other "number presentations" are likely encountered
   in an electronic device (e.g., IP, MAC, time-of-day,
   durations, phone numbers, ARE but bible references AREN'T)
- how do users *colloquially* pronounce numbers (e.g.,
   "0.1203", "101.05", "4005", "8921600002")
- other suggestions to make this easier on the user?
- pitfalls that other developers are likely to stumble on?

I hope I haven't missed anything (*obvious*)  :<  I am
amazed at how many different forms numbers take and how
much is "encoded" in our contextual awareness of them!

Time for bed...

Thx!
--don
On Tue, 03 Jun 2014 09:09:19 -0700, Don Y wrote:

> Hi, > > I have a "fallback" speech synthesizer that is used when the primary > speech synthesizer is unavailable. Depending on the user's abilities, > this may be the *only* output modality available to him/her (i.e., > he/she may not be able to perceive other available modalities) so > *everything* communicated to the user must (potentially) pass through > this channel. > > As the synthesizer is only intended to be used when the system is > operating in a degraded mode, it doesn't have to resolve a limitless > vocabulary. _For_the_most_ part_, I have complete control over what it > will be required to speak. So, I can pass text to it that I know to be > devoid of characteristics that it would be unable to handle "properly". > > For example, "The Polish housekeeper who works for Dr. Stephens in his > house on Stevens Dr in Phoenix bought some furniture polish for the > credenza in his office." > > [with a tiny bit of effort, you can imagine lots of similar constructs > that require significant knowledge of grammar, PoS, and other context to > "get right". Let alone oddities like Billerica, Worcester, Usa, etc.] > > But, there are other (external) text sources that I can't as easily > constrain. So, I have to make a best effort to cover those (unknown) > inputs while not unduly burdening the implementation (Sorry, > Billerica!). > > Adding context to the pronunciation, prosody, etc. algorithms gets > expensive, *fast*. This is a small, highly portable device with very > limited resources (CPU, memory, etc.) and extremely low power > requirements (has to operate for ~16 hours with a very small battery). > > I am happy with the text rules that I have put in place. They cover most > "typical" input that the synthesizer is likely to encounter. Obviously, > input that isn't grammatically correct can be handled however the > algorithm likes (e.g., "The elephant are read?"). > > Recall that this synthesizer sees limited use -- so, the user is > *probably* unaccustomed to its quirks and other idiosyncrasies. > Hopefully, the user *never* hears it speak! But, if he is in a > situation where he is relying on its speech, he's probably already > annoyed (because something else is "not working"). Encountering > something like "411 Length Required" probably won't find him very > willing to understand what was *intended* by that terse message. > > However, "numbers" seem to really benefit from context. Often, a tiny > bit of context is sufficient to enhance the pronunciation (and, thus, > comprehension). But, > other times, you really need to understand what is being said to know > how best to speak the "number(s)". E.g., "The 2300 block of State > Street". > > And, (from surveying users) there appear to be cultural differences in > how things (like numbers) are spoken. E.g., "oh" vs. "zero" (which even > seems to vary *within* a speaker's ruleset!), how/when numbers are read > off as srings of digits, use of "and" as a connective in numeric values > ("three hundred and ten" vs. "three hundred ten"), and the value > represented by 1,000,000,000. > > The cop-out approach is just to recite strings of digits, *always*. > But, try listening to "data" presented in this form for even a few > moments and you'll see how silly that approach is! > > "Your IP address is 10.0.1.223" > "Volume level 18 of 24" > "MAC address 23:C0:11:00:14:89" > "Signal strengths 23.1, 18.6, 8.5 and 33.0" > "Scheduled server maintenance at 03:00" > "Battery time remaining 3:12" > "Contact Dr Smith at (888) 555-1212 x3-1022" > > I *don't* want to put any (other) signaling/control information "in > band" relying, instead, on *limited* context to resolve these issues. > > For example, requiring times of day to be indicated with AM/PM (so > "03:12" is NOT a time of day) and time intervals *without* ("three hours > and 12 minutes"). > At the same time, I don't want to unnaturally burden the algorithms that > *create* (emit) these text strings E.g., requiring all numeric values to > be in scientific notation or, to embed separators every three digits > (imagine how tedious it would be to have to process numbers as *strings* > in order to properly place commas to separate thousands, millions, > etc.). > > Finally, I don't want to force unnatural presentations that a user > employing a *different* output modality (e.g., a video display) would > find tedious. Imagine requiring the text source to pass input of the > form: "Your Eye Pee address is ten dot zero dot one dot two two three". > > My questions: > - what other "number presentations" are likely encountered > in an electronic device (e.g., IP, MAC, time-of-day, durations, phone > numbers, ARE but bible references AREN'T) > - how do users *colloquially* pronounce numbers (e.g., > "0.1203", "101.05", "4005", "8921600002") > - other suggestions to make this easier on the user? > - pitfalls that other developers are likely to stumble on? > > I hope I haven't missed anything (*obvious*) :< I am amazed at how > many different forms numbers take and how much is "encoded" in our > contextual awareness of them! > > Time for bed...
If you had total control over the text, then a possible right answer would be to give it text with embedded clues, or just phonemes. I'm not sure that isn't the right answer anyway, and just let it have problems with the "alien" text. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Op Tue, 03 Jun 2014 18:09:19 +0200 schreef Don Y <this@is.not.me.com>:
> I have a "fallback" speech synthesizer that is used > when the primary speech synthesizer is unavailable. > Depending on the user's abilities, this may be the > *only* output modality available to him/her (i.e., > he/she may not be able to perceive other available > modalities) so *everything* communicated to the > user must (potentially) pass through this channel. > > As the synthesizer is only intended to be used when > the system is operating in a degraded mode, it doesn't > have to resolve a limitless vocabulary.
Why is there a degraded mode? Which parts of the system are intended to be operational in this mode?
> _For_the_most_ > part_, I have complete control over what it will be > required to speak. So, I can pass text to it that I > know to be devoid of characteristics that it would > be unable to handle "properly".
Text, or as Tim said, phonemes.
> [...] > > But, there are other (external) text sources that I > can't as easily constrain.
What is the purpose of attempting to synthesize these?
> [...]
-- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
Hi Tim,
On 6/3/2014 9:39 AM, Tim Wescott wrote:
> On Tue, 03 Jun 2014 09:09:19 -0700, Don Y wrote:
>> I have a "fallback" speech synthesizer that is used when the primary >> speech synthesizer is unavailable. Depending on the user's abilities, >> this may be the *only* output modality available to him/her (i.e., >> he/she may not be able to perceive other available modalities) so >> *everything* communicated to the user must (potentially) pass through >> this channel. >>
>> However, "numbers" seem to really benefit from context. Often, a tiny >> bit of context is sufficient to enhance the pronunciation (and, thus, >> comprehension). But, >> other times, you really need to understand what is being said to know >> how best to speak the "number(s)". E.g., "The 2300 block of State >> Street". >> >> And, (from surveying users) there appear to be cultural differences in >> how things (like numbers) are spoken. E.g., "oh" vs. "zero" (which even >> seems to vary *within* a speaker's ruleset!), how/when numbers are read >> off as srings of digits, use of "and" as a connective in numeric values >> ("three hundred and ten" vs. "three hundred ten"), and the value >> represented by 1,000,000,000.
>> My questions: >> - what other "number presentations" are likely encountered >> in an electronic device (e.g., IP, MAC, time-of-day, durations, phone >> numbers, ARE but bible references AREN'T) >> - how do users *colloquially* pronounce numbers (e.g., >> "0.1203", "101.05", "4005", "8921600002") >> - other suggestions to make this easier on the user? >> - pitfalls that other developers are likely to stumble on? >> >> I hope I haven't missed anything (*obvious*) :< I am amazed at how >> many different forms numbers take and how much is "encoded" in our >> contextual awareness of them!
> If you had total control over the text, then a possible right answer > would be to give it text with embedded clues, or just phonemes. I'm not > sure that isn't the right answer anyway, and just let it have problems > with the "alien" text.
Yes, I had a similar "epiphany", originally. But, like most siren songs, it proved to be misleading. Initially, I looked into *canning* all of the speech: "Hey, I know everything it will *ever* say (nope!), so why not just *record* it? LPC encode every utterance and omit *all* the run-time message processing!" I.e., just treat this as an "audio player". By contrast, a device designed for *visual* output need not be concerned with this (or, messages could equivalently be large *bitmaps* painted on the display thereby eliminating the need for a font generator, etc.!) [Of course, this fragments your code base as you now need to have vastly different user I/O handlers for each type of device "at the abstraction level" -- not just "reification". You're now dealing with entire messages as "objects"] But, that leaves you with little control over the actual "voice" (e.g., male/female/child/etc.) as well as the more specific characteristics of the voice. In addition to picking a voice that suits their personality, users find that different characteristics of a voice may help/hinder intelligibility (pitch, breathiness, etc.) And, it fails miserably when tasked with "variable data"... it turns into a giant "unit selection" problem as you try to piece together words, numbers, etc. each potentially "recorded" with different pitch, timing, prosody, etc. (i.e., "I need to choose from among the several recorded utterances of "fifteen" that has the following stress characteristics..." -- contrast how you pronounce "There are 15 children", "It is 9:15"). [This is hard to simulate using your own voice; but, very obvious when piecing together *recorded* voice samples.] The next (false) "epiphany" came when realizing that I could just *encode* the speech "off-line" and store (stress-marked) phonemes. As with the LPC approach, this eliminates lots of code to do the text-to-sound conversion, stress assignment, prosody, etc. Do everything "at compile time". Concentrate on the "voice" instead of the *content*. This allows you to exert some control over the characteristics of the voice (pitch, breathiness, rate, etc.) that the LPC approach couldn't. E.g., I can choose how to pronounce a particular set of phonemes instead of relying on an LPC encoded *recording* of those phonemes (sounds). And, *how* to pronounce them in a given context/utterance. Words tend to have fewer phonemes than letters so you could, conceivably, encode a second of speech into ~6-8 bytes (speech is about a 50bps channel) and still be able to tailor the "sound" to the user's needs. Removing the text-to-sound, stress, and prosody processing also eliminates the buffers needed for "rewrite" rules (e.g., when you might otherwise have to change an "interpretation" based on context; or, are tasked with converting "3741" into "three thousand, seven hundred and forty one" on-the-fly so that it can then be converted as any other "words"). It also ensures every utterance is "proofed" before deployment (they all have to go through the *offline* converter/synthesizer before making their way into the code base). No surprises after release! But, this approach proved to be tedious for the developer. You can't just write: ASSERT(0 != charge); printf("Battery charge remaining: %d:%02d", charge/60, charge%60); Instead, you have to piece together: ASSERT(0 != charge); speak(PHONEMIZATION("Battery charge remaining")); // const speak(PHONEMIZATION(HALF_STOP); // const if (0 != charge/60) { speak(PHONEMIZATION(NUMBER[charge/60])); if (1 != charge/60) { speak(PHONEMIZATION("hours"); // const } else { speak(PHONEMIZATION("hour"); // const } and"); // const } speak(PHONEMIZATION(NUMBER[charge%60])); if (1 != charge%60) { speak(PHONEMIZATION("minutes"); // const } else { speak(PHONEMIZATION("minute"); // const } [Also note that you can't unilaterally pluralize a noun by adding an "s" sound to the end. Consider: speak(PHONEMIZATION("hour"); // const speak(PHONEMIZATION("s"); // const -- voiced! vs. speak(PHONEMIZATION("hours"); // const contrasted with speak(PHONEMIZATION("minute"); // const speak(PHONEMIZATION("s"); // const -- not voiced! vs. speak(PHONEMIZATION("minutes"); // const I.e., *say* these to yourself] [you also have to hope the prosody/stress assignment of each of those "const's" is appropriate for this presentation instance. E.g., how you pronounce "There is no battery time remaining" differs from how you pronounce "Battery time remaining" in each of the above examples!] So, it doesn't *really* save you anything -- you still have to map between abstract numeric representations and *concrete* vocalizations. I.e., you're back with the same set of questions that I posed. And, it's done NOTHING to deal with "text" coming from external ("alien") sources. "What seems to be the problem, Ms. Cornali?" "My Gizmolator3000 isn't working!" "What is it *not* doing?" "Working!" "I mean, what is it telling you?" "Oh. 'Cannot connect to server'" "Yes, but *why* can't it connect to the server?" "That's what I'm calling YOU for!" "I mean, is there any other information provided?" "No. It just keeps repeating that message every time I try!" In reality, the server may have been reporting any of: "Scheduled maintenance. Try again at 04:00" "Unauthorized MAC addr." "Error 938" "716" "^D" (i.e., anything other than "Connected") but the device can't relay that information to the user unless it already *knows* how to speak each of those messages and, encountering a reply, blindly compares the reply to its repertoire of stored pronunciations. Instead, it maps all "FAIL" responses from the server into "Cannot connect to server" Of course, it can report on issues that it *knows* about "at design time". But, only for services existing at that time and in that (known) *form*. That, of course, means it has to be burdened with all that knowledge *and* the pronunciations of each (or, "user friendly" alternatives tagged with additional information that could be help "Support" resolve the issue by making the *real* message available to them in this "coded" form). [Imagine if your native tongue was Swahili and every (error) message presented to you was in Arabic; how would you converse with a Support technician to resolve your problem -- even if he was fluent in Spanish??] Thankfully, many of the "text" issues can be resolved/avoided... or, understood despite mispronunciation caused by ignorance of context. E.g., if "Polish" was pronounced as "(furniture) polish" (or vice versa) you might initially be puzzled by the example I previously mentioned... *but*, you'd quickly sort it out and understand the intended meaning. The "numbers" are the real pisser. :< There is a lot of speech encoded in numeric presentations. E.g., when you encounter "5:00" you *think* "five O'CLOCK". Or, if the context suggests it represents a time interval, you think "five MINUTES" (or, "five HOURS"). When you encounter 6/15, you think "June fifteenth", not "June fifteen" or "six slash fifteen" (or, perhaps, "six of fifteen"). You probably speak "(800) 555-1212" differently than "(708) 555-1212" -- and both differently than "800, 555, 1212". I've been building a lexicon of "number formats" and their associated "speaking formulae" iteratively -- throwing more and more "sample input" at the algorithms to see which forms I handle suboptimally; then, crafting recognizers for each to try to reduce the stilted nature of that aspect of the speech. I keep getting surprised by forms that I haven't (yet) covered but that are surprisingly common! Hence the first of the questions I posed ("what other 'number presentations' are likely encountered in an electronic device") The second question was an informal survey of speaking *patterns* for numeric quantities. For example, "0.1203" could be "zero point one two zero three", "zero point one two OH three", "point one two oh three", etc. Similarly, "101.05" could be conveyed as "one hundred one point zero/oh five", "one hundred AND one point zero/oh file", "one hundred one and five one-hundredths", "a hundred one point oh five", etc. *I* would pronounce "8921600002" as "eight nine two, one six zero, zero zero, zero two" (note the last four digits are treated as two *doublets* instead of a *triplet* followed by a singleton as might have been expected from the grouping of the preceeding digits!) Run samples by your friends and neighbors and see how each has a particular (usually inconsistent) set of rules for how they pronounce these! Third question looked for other ideas to help the user. E.g., I had initially included a "spell mode" (numbers *and* letters). But, this gets really tedious! And, brought me back to wondering, "if they have to resort to spellings of the messages, then the quality of the synthesis -- or, nature of the messages -- must be total crap!" So, fix the *real* problem! OTOH, it made sense to allow the user to change the voice to something better suited to his/her hearing and comprehension. Likewise, speaking rate. I am currently toying with the idea of a "word-at-a-time" mode so the user can "step" through words individually (instead of having to hear the entire message replayed) The last question anticipates what problems others writing code for this environment are likely to trip over. E.g., the example of "pluralizing" a noun is something an eager developer is likely to get wrong ("I can just store the singular forms of each noun and pluralize them by adding an 's'!") -- much the same way someone might naively pluralize "thief" as "thiefs". Or, thinking himself exceedingly clever and pronouncing large values "the way you were taught in school" ("eight billion, nine hundred twenty one million, six hundred thousand, two" for the aforementioned example). Never thinking about how much information that likely conveys to the user IN THIS APPLICATION/SITUATION (i.e., it is unlikely that such a value is intended to be interpreted as an ordinal in that sense; more likely a "string of digits" is appropriate) (sigh) English is such a bastard language! I'm sure there are other languages that are far more *regular*! :< --don
Hi Boudewijn,

On 6/4/2014 5:16 AM, Boudewijn Dijkstra wrote:
> Op Tue, 03 Jun 2014 18:09:19 +0200 schreef Don Y <this@is.not.me.com>: >> I have a "fallback" speech synthesizer that is used >> when the primary speech synthesizer is unavailable. >> Depending on the user's abilities, this may be the >> *only* output modality available to him/her (i.e., >> he/she may not be able to perceive other available >> modalities) so *everything* communicated to the >> user must (potentially) pass through this channel. >> >> As the synthesizer is only intended to be used when >> the system is operating in a degraded mode, it doesn't >> have to resolve a limitless vocabulary. > > Why is there a degraded mode? Which parts of the system are intended to > be operational in this mode?
The device is a "terminal", of sorts. Normally, the (real) speech synthesizer is located remotely and passes "audio" to the device. However, if that synthesizer is "unavailable" (down, improper authorization, comms failure, etc.) you still need to be able to tell the user these things! "Hmmm... I'm not getting any sound. Have I got the volume turned down too much? Is the battery dead? Why isn't the server responding?" Think of an X terminal as a conceptual model. You may have thousands of fonts available on your font server. But, the X terminal has to have AT LEAST ONE built in to be able to talk with the user (e.g., during configuration/setup) *before* the terminal has access to the font server!
>> _For_the_most_ >> part_, I have complete control over what it will be >> required to speak. So, I can pass text to it that I >> know to be devoid of characteristics that it would >> be unable to handle "properly". > > Text, or as Tim said, phonemes.
See my reply to Tim. Briefly, canned speech ("recorded" or phoneme based) falls down on any messages with variable content. "Your IP address is %d.%d.%d.%d", "The time is %d:%02d", "Volume level is %d%%". You still need something to "evaluate" numerics in a particular context. And, "alien" text (Tim's term) leaves you helpless. How do I tell the user that the server is refusing to accept his credentials? Or, that the server will be down for maintenance until 4:00PM and that the server at A.B.C.D should be used in the interim? I.e., I would have to be able to constrain *everything* that will eventually talk to the device and fold all of the accommodations for these external devices into the design of *this* device. It's undesirable to have those devices emit "phonemes" as they must now accommodate every potential output modality that some (future) remote device requires/supports. E.g., should they also output their messages in level two Braille for remote Braille displays? (or, does that Braille display have to pre-store all text that it could potentially encounter and associated Braille equivalents?)
>> [...] >> >> But, there are other (external) text sources that I >> can't as easily constrain. > > What is the purpose of attempting to synthesize these?
See above (and other reply). The device can only speak things that it knows about. E.g., I can tell the user that the battery is low, signal strength is insufficient for contact with server (move closer), error rate is too high for the connection (local noise sources?), the server's response time is too high (too many clients? pick another server??), etc. But, I can't tell the user about issues that the (remote) server wants to communicate -- unless I also constrain *it*! (and never let it evolve without requiring software updates of all potential clients). So, if the guy maintaining the server brings it down for maintenance and specifies: "Server down for maintnance [sic]. Contact Boudewijn Dijkstra for assistance at 813567 after 5:00PM" What do I report to the user (other than "connection failed")? He's misspelled maintenance so any attempt (by me) to find a prestored pronunciation will fail. [I can require a preceding numeric "reply code" to assist the "terminal" in determining the *intent* of the message. (e.g., "925 Server Maintenance") But, the balance of the message is unavailable to the user. So, when he/she goes looking for help to resolve his/her problem (or, attempts to resolve it directly), he has little more than this "message code" to go by.] Time to make some ice cream! Butter Pecan. Mmmmm... fat city! --don
Op Wed, 04 Jun 2014 20:15:23 +0200 schreef Don Y <this@is.not.me.com>:
> On 6/4/2014 5:16 AM, Boudewijn Dijkstra wrote: >> Op Tue, 03 Jun 2014 18:09:19 +0200 schreef Don Y <this@is.not.me.com>: > [...] >>> >>> But, there are other (external) text sources that I >>> can't as easily constrain. >> >> What is the purpose of attempting to synthesize these? > > See above (and other reply). The device can only speak > things that it knows about. E.g., I can tell the user that > the battery is low, signal strength is insufficient for > contact with server (move closer), error rate is too high > for the connection (local noise sources?), the server's > response time is too high (too many clients? pick another > server??), etc.
These are issues that are directly helpful to the user, i.e. things that the user may be able to do something about.
> But, I can't tell the user about issues that the (remote) > server wants to communicate -- unless I also constrain *it*! > (and never let it evolve without requiring software updates > of all potential clients).
These are issues that are most likely not directly helpful to the user. At this point the device might output perfectly synthesized text, DTMF tones, a fax message, it doesn't really matter as the user cannot directly employ the information to make things work again. In other words, this kind of information is generally best relayed to a helpdesk of some sort. So, speech synthesis should not be an absolute requirement. -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
Hi Boudewijn,

On 6/5/2014 1:20 AM, Boudewijn Dijkstra wrote:
> Op Wed, 04 Jun 2014 20:15:23 +0200 schreef Don Y <this@is.not.me.com>: >> On 6/4/2014 5:16 AM, Boudewijn Dijkstra wrote: >>> Op Tue, 03 Jun 2014 18:09:19 +0200 schreef Don Y <this@is.not.me.com>: >>> >>>> But, there are other (external) text sources that I >>>> can't as easily constrain. >>> >>> What is the purpose of attempting to synthesize these? >> >> See above (and other reply). The device can only speak >> things that it knows about. E.g., I can tell the user that >> the battery is low, signal strength is insufficient for >> contact with server (move closer), error rate is too high >> for the connection (local noise sources?), the server's >> response time is too high (too many clients? pick another >> server??), etc. > > These are issues that are directly helpful to the user, i.e. things that > the user may be able to do something about. > >> But, I can't tell the user about issues that the (remote) >> server wants to communicate -- unless I also constrain *it*! >> (and never let it evolve without requiring software updates >> of all potential clients). > > These are issues that are most likely not directly helpful to the user. > At this point the device might output perfectly synthesized text, DTMF > tones, a fax message, it doesn't really matter as the user cannot > directly employ the information to make things work again. In other > words, this kind of information is generally best relayed to a helpdesk > of some sort. So, speech synthesis should not be an absolute requirement.
How does the user *know* what the device is wanting to tell him in order to relay that information to the help desk? I.e., you have to get the information *to* the user before he can relay it to "Support". Perhaps have the device store the error message in FLASH and have the user snail mail the device to Support? :)
Op Thu, 05 Jun 2014 10:56:20 +0200 schreef Don Y <this@is.not.me.com>:
> On 6/5/2014 1:20 AM, Boudewijn Dijkstra wrote: >> Op Wed, 04 Jun 2014 20:15:23 +0200 schreef Don Y <this@is.not.me.com>: >>> On 6/4/2014 5:16 AM, Boudewijn Dijkstra wrote: >>>> Op Tue, 03 Jun 2014 18:09:19 +0200 schreef Don Y <this@is.not.me.com>: >>>> >>> [...] >> >>> But, I can't tell the user about issues that the (remote) >>> server wants to communicate -- unless I also constrain *it*! >>> (and never let it evolve without requiring software updates >>> of all potential clients). >> >> These are issues that are most likely not directly helpful to the user. >> At this point the device might output perfectly synthesized text, DTMF >> tones, a fax message, it doesn't really matter as the user cannot >> directly employ the information to make things work again. In other >> words, this kind of information is generally best relayed to a helpdesk >> of some sort. So, speech synthesis should not be an absolute >> requirement. > > How does the user *know* what the device is wanting to tell him > in order to relay that information to the help desk? I.e., you > have to get the information *to* the user before he can relay it > to "Support".
Assuming that the device is not subdermally implanted, the user doesn't need to hear or understand the information at all! The device could say: "Please hook me up to a phone line, I wish to send a problem report" or something similar. Then the user could listen in and wait for the exchange to finish. -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
Hi Boudewijn,

On 6/5/2014 6:10 AM, Boudewijn Dijkstra wrote:

[attrs elided]

>>>> But, I can't tell the user about issues that the (remote) >>>> server wants to communicate -- unless I also constrain *it*! >>>> (and never let it evolve without requiring software updates >>>> of all potential clients). >>> >>> These are issues that are most likely not directly helpful to the user. >>> At this point the device might output perfectly synthesized text, DTMF >>> tones, a fax message, it doesn't really matter as the user cannot >>> directly employ the information to make things work again. In other >>> words, this kind of information is generally best relayed to a helpdesk >>> of some sort. So, speech synthesis should not be an absolute >>> requirement. >> >> How does the user *know* what the device is wanting to tell him >> in order to relay that information to the help desk? I.e., you >> have to get the information *to* the user before he can relay it >> to "Support". > > Assuming that the device is not subdermally implanted, the user doesn't > need to hear or understand the information at all! The device could say: > "Please hook me up to a phone line, I wish to send a problem report" or > something similar. Then the user could listen in and wait for the > exchange to finish.
So now the device needs to be able to connect to a phone line (acoustically or otherwise) *and* there needs to be a phone line *handy* (as well as accessible to the device's dialing capabilities -- e.g., not behind a PBX). And, to know how to report dialing/connection problems there, as well. All this just to avoid being able to convey "alien" messages and pronounce numbers in various formats in an intelligent manner? One scheme I could adopt is to just have the server return a "result code" for *every* condition. Those codes known to the device at time of manufacture can be explained (*by* the device) to the user. Those *unknown* can simply be conveyed to the user as a "number". That puts the burden on the device to know how to explain each error code. ("error codes"... welcome to the 60's! :<) And, also "emasculates" the communication medium used between the device and the user. It would be like using a full graphic CRT to display BLINK CODES (i.e., flash the screen!) to the user in much the same way a diagnostic LED would instead of taking advantage of its inherent capability to display textual/graphic messages! It seems easier/safer to allow the server to explain itself and report any condition that *it* deems significant -- knowing that the device can relay that to the user effectively. Having "capable" speech in the device allows intelligent agents to be interposed between, e.g., the *real* speech synthesizer and the client device -- without *also* giving them speech (and every other output modality supported!) capabilities. An agent could, for example, "notice" that the speech synthesizer is unusually busy and report this fact to the client using a *text* channel (to the fallback synthesizer in the device). A colleague dropped a tarball of regex's in my mbox this morning that *looks* like it could address more numeric (and text) formats than I could ever make use of (e.g., SSN's, phone numbers in dialing plans for different countries, postal codes, etc.). Hopefully, I can disambiguate those that are of significant interest (in my market). Then, work them into my front-end "parser" and have my problem "solved"! I should have considered that avenue before posting: "patterns --> regex's"! D'uh... :< Thx! --don
Hi Don,

On Tue, 03 Jun 2014 09:09:19 -0700, Don Y <this@is.not.me.com> wrote:

>[with a tiny bit of effort, you can imagine lots of >similar constructs that require significant knowledge >of grammar, PoS, and other context to "get right". Let >alone oddities like Billerica, Worcester, Usa, etc.]
Let's just agree that Massachusetts is hopeless and write it off. Even making allowances for pronunciation, poor diction and odd colloquialisms, there are too many MA natives who badly misspeak [including many who theoretically have been well educated]. And don't pick on poor Worcester: it's a nice city ... in England ... that isn't responsible for what Massachusetts did to it's name. Historically it was pronounced the same as the English city - as was Gloucester, Medford, Woburn, Salisbury, etc. The scratch-your-head "huh?" pronunciations are all post Revolution (some post 1812).
>Thx! >--don
George [I can mock MA because I'm from MA: I was born there and I live there currently. Thankfully, during my formative years I lived elsewhere.]
The 2026 Embedded Online Conference