EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

Resolving typos in research pubs

Started by Don Y January 23, 2015
> _From Text to Speech: The MITalk System_ discusses Allen, Hunnicutt > and Klatt's work -- in reasonable detail.
> Note that the Votrax VS6 predates KlattTalk (MITalk, DECtalk).
Allen was publishing from 1968 onwards. 1972 was roughly the time Gagnon filed his first patent. The MIT system is the work of several people over many years. Klatt did the transfer to a commercial product. The success of Votrax did help him there. As far as i know Votrax didn't do much text-to-speech research. NRL was using a VS6 because that was the only commercially available frontend then. And so Votrax were lucky to get back a TTS-system that way. Whatever the software was, i can't see the SC01A or for that matter the VS6 match up to the more expensive DECtalk. Since they had to be or intended to be somewhat backward compatible the SSI263 was limited too. That said, i have never heard both systems. But i am rather skeptical for general applications of anything claiming TTS and beeing small.
> Note that folks who listened to the KRM (Votrax) soon developed a > much improved sense of comprehension.
That i do not doubt. But typical users were blind people and thats a closed user group. Some of the old military communication systems ( CVSD 16kBit/sec ; early LPC10 ) did work too within their limits but would no be viable for general commercial application.
>> You are on that road too. > You appear to have missed the comment in my initial post: > "As I can operate them in semi-limited domains (since I *tend* > ------------------------------^^^^^^^^^^^^^^^^^^^^ > to be the source of the text they utter),
For a typical embedded speech output application the selling point of the SC01A ( maybe plus TTS ) was: the user can create unlimited speech rapidly. For TI/LPC and Mozer the IC was a bit cheaper then the SC01A and the quality much better. But getting it done was slow and expensive. I am in no doubt that one can create technically a hybrid version, but i am unclear where the commercial advantage would be. MfG JRD
On 1/25/2015 10:26 AM, Rafael Deliano wrote:
>> CVSD was used in video games in the early 80's. Even clocking >> the devices at their maximum rates left much to be desired in the >> quality of the speech that resulted. > > CVSD at 16kbit/sec that the military used was poor. > We used CVSD at 24kBit/sec in answering machines sold by Philips, > AEG for cars in 1985. These were the early analog cellular radios > for the telephone network here in Germany. CVSD is noisy, but the analog > radio-link was noisy too. So the public didn't mind.
Likewise its use in video (arcade) games -- the consumer is more titillated than concerned with the poor quality of the speech. Likewise, visually impaired users could make *huge* allowances to gain "on demand" access to print material. The alternative being to find someone to sit down and *read* it to them!
> Simple chips from Harris and CML, nothing complicated. > This was a viable application with many thousand units sold. > The company did earlier try to sell products based on the > SC01A but there was no market for that sort of speech.
Wrong capabilities match. You didn't need the unlimited vocabulary that it provided.
>> "Dr. Reed had already read the Polish language book that he >> was reading when I drove up to his house on Reading Dr. and >> caught him polishing off a small brandy." > > I doubt that messages of that length are typical for commercial > applications.
The point of the example was to illustrate how high a bar is set if you want to accurately speak the sorts of sentences that we encounter on a daily basis. Dr. (doctor) vs. Dr. (drive) Polish (nationality) vs. polish ("to make shiny") Read (past tense) vs. reading (present) vs. Reading (pronounced "redding")
> > You have to know *every* utterance that you are likely to > > encounter "at run time". > > The typical commercial speech output applications do not > have a "at run time" requirement. The biggest vocabulary > i can think of is speech output in GPS car navigation.
How would you read incoming email to a user? Or, even tell the user that the connection to the IMAP (eye-map vs I M A P) server has timed out -- or their credentials have been rejected (user name/password)? Or, that the server is down for repair and expected to be up again at 1:00PM DST (D S T? Daylight Savings Time?)? Or, that there are 23 minutes of estimated battery life remaining? Or, that the signal quality is poor due to obstructions in the RF signal path? Notice how GPS units choke on many of the street names? Or, even names of towns (e.g., "Berlin" is pronounced in two different ways here -- US)? Other towns defy any logical pronunciation (Worcester -> "wooster"; Billerica -> "billrika"; Phoenix -> "feenigs") I.e., you are thinking in terms of *fixed*, closed (limited domain) applications. I'm addressing a wider class of applications in a single "device". (in much the same way that the KRM addressed with a wider range of *books*)
> Digitalker demoboards had the complete vocabulary in ROM, but > that predates FLASH. For a small fixed language application > the user has to have access to a database that contains "all" words. > He selects what he needs and downloads them to his hardware.
Yes, and he is then *constrained* to speak only those words. If the application needs him to speak some *other* word, he can't. I.e., the application must be closed/limited domain.
> Of course there is the old problem to get the words > done in a recording studio. But ( copyright niceties aside ) > i have found language trainer CDs where someone utters > a word in German and then in English reasonably good testing > material. True, "the most common 2000 words of the english language" > do not contain everything the application vocabulary requires. But > then some butchering ...
Again, your butchering won't help you if you want to say "research" or "pubs" if those words don't already exist in your vocabulary. You end up (at best) trying to paste together phones from words that you have on hand (unit selection) -- without the benefit of (e.g.) a diphone representation (so, it sounds like you pieced together fragments of words). Again, if you have a limited domain application, you can can all of your speech. E.g., our answering machine sounds very natural when it tells you that no one is home. OTOH, when it tries to tell me how many messages are waiting, it starts to sound artificial. And, I surely wouldn't expect it to be able to tell me *who* called (by looking up the CID and reciting the name of the caller to me). As you move towards larger/unconstrained vocabularies, you *need* a mechanism to decide how to pronounce the text you are encountering. *Spelling* everything is unacceptable: "You received a call from R A F A E L,,, D E L I A N O today at 3:27PM" Trying to handle *all* possible text *correctly* leads to an implementation that is bloated -- and *STILL* will only address a particular *style* of speech. E.g., how would you tell the user that his stored password is "puppy"? Or, "JhfD@f5%"?
On 1/25/2015 10:28 AM, Rafael Deliano wrote:
>> _From Text to Speech: The MITalk System_ discusses Allen, Hunnicutt >> and Klatt's work -- in reasonable detail. > >> Note that the Votrax VS6 predates KlattTalk (MITalk, DECtalk). > > Allen was publishing from 1968 onwards. 1972 was roughly > the time Gagnon filed his first patent. The MIT system is the > work of several people over many years. Klatt did the transfer > to a commercial product. The success of Votrax did help him there. > As far as i know Votrax didn't do much text-to-speech research. > NRL was using a VS6 because that was the only commercially available frontend > then. And so Votrax were lucky to get back a TTS-system > that way. Whatever the software was, i can't see the SC01A or for > that matter the VS6 match up to the more expensive DECtalk. Since
DECtalk is more expensive, physically larger and (probably) draws more power (I'd have to check the details, for sure). E.g., the VS6 fit in a box about 60% of the size of the DTC-01. The Votrax packaging was undoubtedly more robust (though a "custom" DECtalk could have probably been created to be more durable and more readily *integrated* into a product). The DECtalk Express is *much* smaller than the original DECtalk -- probably a factor of 5 volumetrically (60% of a "carton" of cigarettes) -- and greatly reduced power consumption (i.e., battery powered). But, comparing solutions of one timeframe to those of another is never fair. Gagnon was building hardware filters so his solution would ALWAYS be constrained to that which could be reified AS a hardware filter. Whether it was a bunch of op-amps and discretes in a potted case -- or a more "integrated" approach, it was still a genuine filter with *only* the type of tuning and transitioning present that he could realize *in* hardware. There are no "smarts" in the design.
> they had to be or intended to be somewhat backward compatible the SSI263 was > limited too. > That said, i have never heard both systems.
All formant based synthesizers sound largely the same. You'd be hard pressed to tell the difference between a Votrax and a DECtalk if operating in the same "voice" (i.e., basic formants). However, DECtalk embodies a set of LTS rules whereas Votrax relies on "something external" to push phoneme codes at it. So, the speaking *pattern* of a Votrax is largely defined by the LTS algorithm that drives it. E.g., McIlroy's "sounds different" from the NRL when you explore a greater variety of input samples. Push phoneme codes (from your favorite LTS front-end) at the DECtalk and you'd wonder if it wasn't the Votrax! By contrast, diphone synthesizers sound like real people (because the diphones were recorded from real speech). So, it's more a question of how good the unit selection algorithm is at piecing together "compatible" diphones -- without sounding like it is "piecing together diphones" :> But, you then need to be able to store a large diphone inventory! And, if the user doesn't *like* that "voice", there are limited remedies available to you (as building a new voice isn't just "tweeking a few settings")
> But i am rather skeptical for general applications of anything > claiming TTS and beeing small.
Again, you're ignoring my "semi-limited domain" qualification!
>> Note that folks who listened to the KRM (Votrax) soon developed a >> much improved sense of comprehension. > > That i do not doubt. But typical users were blind people and thats > a closed user group. Some of the old military communication systems > ( CVSD 16kBit/sec ; early LPC10 ) did work too within their limits > but would no be viable for general commercial application. > >>> You are on that road too. >> You appear to have missed the comment in my initial post: >> "As I can operate them in semi-limited domains (since I *tend* >> ------------------------------^^^^^^^^^^^^^^^^^^^^ >> to be the source of the text they utter), > > For a typical embedded speech output application > the selling point of the SC01A ( maybe plus TTS ) was: > the user can create unlimited speech rapidly. > For TI/LPC and Mozer the IC was a bit cheaper then the SC01A > and the quality much better. But getting it done was slow and > expensive. > I am in no doubt that one can create technically a hybrid version, > but i am unclear where the commercial advantage would be.
Samples of the output I have to address (and the text strings that would be presented to the TTS subsystem for each of them): The current time is 12:34PM, MST. Today is Sunday, 25 Jan 2015. Volume level: 45% Balance: 30% left of center Voice selection: "casual2" Voice parameters: <yada, yada, yada> Battery life remaining: 4:23 Estimated recharge time: 1:47 Relative signal strengths: A=12.4; C=15.0; D=3.5 Using beacon C. MAC address 12:34:56:78:9A:BC Servers available: "...", "...", "..." Attempting connection to server "..." Service unavailable. The server replied "..." Access denied. The server replied "..." User ID is "rafael_deliano". Current passphrase is "^gF4WxKK98". Plus, of course, the dialogs (and "help") to adjust each of these settings... It's *easy* to get "quality" speech for the "known" (a priori) text portions of these prompts. Even the numerical parameters are relatively easy to encode -- *if* you are conscious of how they will be spoken when you adopt them! (e.g., "45%" instead of "0.45") But, once you accept unconstrained input (e.g., "The server replied..."), you have no way of accommodating that "unknown" -- short of resorting to spelling the contents of the message: "S E R V E R D O W N F O R R O U T I N E M A I N T E N A N C E. T R Y A G A I N L A T E R. F O R A S S I S T A N C E , C O N T A C T B I L L S L Y A T ( 7 0 8 ) 5 5 5 - 1 2 1 2 e x t 5 4 7" Similarly, do you want to pass ASCII strings to the speech subsystem and have *it* know that it shouldn't even *try* to pronounce "^gF4WxKK98" because that will only result in the user requesting clarification? Or, do you want to have to invoke a different interface to the speech subsystem when the *content* you are passing should be treated differently? (e.g., "45%" instead of "forty-five percent") This is why restricting yourself to a limited domain approach is tedious and ineffective. You can't just treat the speech subsystem as an output device but, rather, have to be aware of what you are likely to be saying in each invocation. Or, restrict yourself to simply not outputting things that it won't be able to say "well"` ("Access denied. The server said something but I have no idea what it was.")
The 2026 Embedded Online Conference