Hi,
Sorry for the poor choice of subject line. :(
I have a few different speech synthesizers that I've been refining
for a product. All are intended to be *very* lightweight (minimal
run-time resources). Voice quality, pronunciation, etc. isn't essential
(but don't want to deliberately hamper performance).
As I can operate them in semi-limited domains (since I *tend* to be
the source of the text they utter), I've opted to adopt some of the
classic approaches to the text-to-phoneme portion of the algorithms
instead of trying to begin a study in linguistics, etc. (even
bloated synthesizers have problems with speech so why waste effort
trying to *approach* their performance levels with 1% of *their*
resources?!)
Most of the work I'm adopting is decades old. Current trends rely
on having *lots* of resources available (big dictionaries, MIPS, etc.)
so they've all gone off in a different direction. So, contacting
original authors is a dubious proposition ("Hey, do you remember that
work you did 30 years ago? I've got some niggly little detail that
I need help resolving. Off the top of your head...")
[I've had *some* success -- thx DM!]
Many of the documents are N-th generation photocopies, fiche, etc. So,
lots of artifacts in them ("Is that a speck of lint or a backslash?")
But, I've been able to resolve many of the "unintended additions" with
a bit of careful examination of the details involved. The worst
cases are the long lists of (hundreds of) rules -- any of which might
be corrupted by a speck of paper lint, a crease in the original when
it was photocopied, etc.
But, there are some things that simply can't be attributed to copying
errors. I.e., cases where glyphs are obviously *missing*. And, others
where something is present in a legend -- yet never occurs elsewhere!
Still other ambiguities exist (Is this instance of "YL" to be interpreted
as the legend symbol "YL"? Or, as the legend symbol "Y" followed by the
legend symbol "L"? And, what is the *effective* difference??)
[This sort of crap happens when people aren't careful preparing docs.
And, when they don't (or can't?) "cut and paste" from the ACTUAL
SOURCE CODE into the final documentation but, instead, try to transcribe
things manually: "Is that a lowercase L or a digit 1?"]
I *think* the only way I can *hope* (no guarantee) to resolve these sorts
of things is to throw lots of data at it and hope to see a pattern in the
failure(s) that result. Perhaps even instrumenting my code so that I can
flag each datum that tickles a "suspicious rule". Then, hope I can
fathom what they have in common and how to resolve the error.
This is complicated by the fact that the algorithms aren't "perfect"
to begin with. So, the idea of comparing computed pronunciations
against a *dictionary* of pronunciations would be ineffective as it
would flag all of the "semi-acceptible" pronunciations as "errors".
I don't have ready access to the original data from which the rules were
derived (nor the "private notes" by which they decided to trade off
performance of one rule vs. another in certain instances).
[It's actually fascinating to look at word spellings in detail and the
big differences in their pronunciations! E.g., water/pater/later;
valentine/aborigine/clandestine; etc.]
Does this approach seem to make sense? I.e., tag each input that
tickles a suspicious rule and try to resolve the problems by
"staring at them"? Any other suggestions that might be more
productive? Esp given that we each view words as having specific
pronunciations and, without religiously consulting a "reference",
can easily dismiss what *appears* to be a problem as a NON problem
(e.g., most folks seem to mispronounce "salmon" so wouldn't notice
if the algorithm ALSO mispronounced it!)
[N.B. when I refer to examining the "flagged output", I don't mean
*audio* output but, rather, phonemic transcriptions of the input]
Thx!
PS: I didn't bother withthe *.speech.* groups as they all appear to
be moribund
Resolving typos in research pubs
Started by ●January 23, 2015
Reply by ●January 23, 20152015-01-23
Don Y <this@is.not.me.com> wrote:> Sorry for the poor choice of subject line. :(> I have a few different speech synthesizers that I've been refining > for a product. All are intended to be *very* lightweight (minimal > run-time resources). Voice quality, pronunciation, etc. isn't essential > (but don't want to deliberately hamper performance).(snip)> Most of the work I'm adopting is decades old. Current trends rely > on having *lots* of resources available (big dictionaries, MIPS, etc.) > so they've all gone off in a different direction. So, contacting > original authors is a dubious proposition ("Hey, do you remember that > work you did 30 years ago? I've got some niggly little detail that > I need help resolving. Off the top of your head...")The first, and close to last, time I worked on this problem was summer 1977, what would now be called a summer intern, but we didn't call them that at the time. The person I was working for had an actual Altair 8800 with non-Altair 64K DRAM. He bought a voice synthesizer S-100 card for it, and we were trying it out. It is long enough now that I don't remember if we typed in phonemes or words. Maybe there was a BASIC program that converted words to phonemes and assembly program (which we had to fix up, as we used a different assembler than most) to run the hardware. -- glen
Reply by ●January 24, 20152015-01-24
Hi Glen, On 1/23/2015 4:15 PM, glen herrmannsfeldt wrote:> Don Y <this@is.not.me.com> wrote:>> Most of the work I'm adopting is decades old. Current trends rely >> on having *lots* of resources available (big dictionaries, MIPS, etc.) >> so they've all gone off in a different direction. So, contacting >> original authors is a dubious proposition ("Hey, do you remember that >> work you did 30 years ago? I've got some niggly little detail that >> I need help resolving. Off the top of your head...") > > The first, and close to last, time I worked on this problem was > summer 1977, what would now be called a summer intern, but we didn't > call them that at the time. The person I was working for had an > actual Altair 8800 with non-Altair 64K DRAM. He bought a voice > synthesizer S-100 card for it, and we were trying it out.Yes, most of the "classic" work dates from the late 60's to the early 80's (i.e., when "computers" were common enough to be accessible for this sort of thing yet still "hog-tied" in terms of real resources). The card you reference was probably a "discrete" formant synthesizer similar to that which Gagnon produced (the Votrax "board set" which later became the Artic/SSI SC-01/2 "chips"). [I'd love to get my hands on a VS6.3 board set but it's not worth the time to chase one down -- "just to reminisce"...] IIRC, pure software (formant) synthesizers weren't really practical until closer to 1980 (Klatt et al.).> It is long enough now that I don't remember if we typed in phonemes > or words. Maybe there was a BASIC program that converted words to > phonemes and assembly program (which we had to fix up, as we used > a different assembler than most) to run the hardware.The most common "public" text-to-phoneme algorithm of that era had to be the NRL ruleset. You could cram the entire ruleset into about 2.5KB and the algorithm to drive it was relatively simple/straightforward. And, crude as all hell! (no inflection control, prosody, etc.) I posed my problem last night ("Boys night out") and got a couple of suggestions that I will follow up on as time permits. One was particularly interesting and, from the notes on the cocktail napkins I fished out of my pocket this morning, looks like it should give me more than I need with very little work! Always interesting to see how other minds approach problems! :> But, I think I will keep the "throw lots of data at it" approach on hand and, instead of using it to help resolve the ambiguities/omissions in the published documents, will use it to help *evaluate* the different algorithms that I've implemented. The trick, then, will be to come up with a "scoring" criteria to allow for a "fair" comparison of performance. Maybe *literally* compare results to some authoritative PRONOUNCING DICTIONARY! It will also be a good way to quantify run-time performance (resource utilization).
Reply by ●January 24, 20152015-01-24
Don Y <this@is.not.me.com> wrote:>> Don Y <this@is.not.me.com> wrote:>>> Most of the work I'm adopting is decades old. Current trends rely >>> on having *lots* of resources available (big dictionaries, MIPS, etc.) >>> so they've all gone off in a different direction. So, contacting >>> original authors is a dubious proposition ("Hey, do you remember that >>> work you did 30 years ago? I've got some niggly little detail that >>> I need help resolving. Off the top of your head...")(snip, then I wrote)>> The first, and close to last, time I worked on this problem was >> summer 1977, what would now be called a summer intern, but we didn't >> call them that at the time. The person I was working for had an >> actual Altair 8800 with non-Altair 64K DRAM. He bought a voice >> synthesizer S-100 card for it, and we were trying it out.> Yes, most of the "classic" work dates from the late 60's to the early 80's > (i.e., when "computers" were common enough to be accessible for this sort > of thing yet still "hog-tied" in terms of real resources). The card > you reference was probably a "discrete" formant synthesizer similar to > that which Gagnon produced (the Votrax "board set" which later became the > Artic/SSI SC-01/2 "chips").Now that you say it, Votrax does sound right. It would have been a single S100 board.> [I'd love to get my hands on a VS6.3 board set but it's not > worth the time to chase one down -- "just to reminisce"...]> IIRC, pure software (formant) synthesizers weren't really > practical until closer to 1980 (Klatt et al.).I suppose so, but as well as I remember, it wasn't all that much hardware. (snip on not remembering)> The most common "public" text-to-phoneme algorithm of that era had to be > the NRL ruleset. You could cram the entire ruleset into about 2.5KB and > the algorithm to drive it was relatively simple/straightforward. And, > crude as all hell! (no inflection control, prosody, etc.)That sounds right. (snip)> But, I think I will keep the "throw lots of data at it" approach on hand > and, instead of using it to help resolve the ambiguities/omissions in the > published documents, will use it to help *evaluate* the different algorithms > that I've implemented. The trick, then, will be to come up with a "scoring" > criteria to allow for a "fair" comparison of performance. Maybe *literally* > compare results to some authoritative PRONOUNCING DICTIONARY!-- glen
Reply by ●January 24, 20152015-01-24
On 1/24/2015 12:43 PM, glen herrmannsfeldt wrote:>> Yes, most of the "classic" work dates from the late 60's to the early 80's >> (i.e., when "computers" were common enough to be accessible for this sort >> of thing yet still "hog-tied" in terms of real resources). The card >> you reference was probably a "discrete" formant synthesizer similar to >> that which Gagnon produced (the Votrax "board set" which later became the >> Artic/SSI SC-01/2 "chips"). > > Now that you say it, Votrax does sound right. It would have been > a single S100 board.The Votrax synthesizers were pretty large. E.g., the VS6 was four (potted) boards -- each about 3"x8" in a "chassis" with a power supply, etc. Chances are, the board you were using was a simpler analog synthesizer: a couple of noise sources feeding a set of tuned filters (resonators) that tried to approximate the resonances of the vocal tract. There were lots of "low resource usage" approaches to speech synthesis in that time period. Most had pretty dreadful "output" (I used to joke that the votrax was the only thing capable of penetrating concrete walls!) IIRC, Digitalker is also of that approximate vintage.>> [I'd love to get my hands on a VS6.3 board set but it's not >> worth the time to chase one down -- "just to reminisce"...] > >> IIRC, pure software (formant) synthesizers weren't really >> practical until closer to 1980 (Klatt et al.). > > I suppose so, but as well as I remember, it wasn't all that > much hardware.Yes, see above. Even the software-based synthesizers (e.g., Klatt) that followed weren't "all that much (modeled) hardware". The advantages they had (besides cost) were more effective control of the transitioning between sounds (i.e., dynamically retuning the resonators with knowledge of their *intended* target frequencies and bandwidths). A bit easier to approach more "natural speech" when you have more dynamic control.>> The most common "public" text-to-phoneme algorithm of that era had to be >> the NRL ruleset. You could cram the entire ruleset into about 2.5KB and >> the algorithm to drive it was relatively simple/straightforward. And, >> crude as all hell! (no inflection control, prosody, etc.) > > That sounds right.Other (earlier and later) rulesets were of comparable complexity (in terms of numbers of rules) but tended to have a less rigid algorithm by which they were applied. E.g., the NRL ruleset only looked at source text so a one-pass design was possible (this is actually one of the challenges in trying to come up with *truly* low resource implementations... you don't want to have to buffer entire sentences, phrases, etc. -- because they can be of nondeterministic length!)
Reply by ●January 25, 20152015-01-25
> IIRC, pure software (formant) synthesizers weren't really practical until > closer to 1980 (Klatt et al.).To quote from Hertz �SRS Text-To-Phoneme Rules A Three Level Rule Strategy� ICASSP 1981 �A number of English text-tophoneme systems exist. MITalk-79, for example, uses a large morph dictionary and a small set of rules. The system is very accurate, but it is too large for many applications. The Naval Research Laboratory (NRL) system, on the other hand uses only a small set of rules with no dictionary. These rules require little storage space, but do not perform with the kind of accuracy that is desirable for most appications. Other text-to-phoneme strategies generally lie somewhere between these two extremes.� Klatt was MITalk-79 http://www.embeddedFORTH.de/temp/klatt.pdf ( text in german ) The fontend was on a TMS320 DSP and much more elaborate then the Votrax I/II. Notice the large bank of EPROMs. The source for NRL http://www.embeddedFORTH.de/temp/elovitz.pdf ( text in german ) was originally published in SNOBOL but ported to other languages http://www.embeddedFORTH.de/temp/morris.pdf The General Instruments SP0256-AL2 "Votrax clone" had a controller CTS256-AL2 often found at ebay that implemented a version of it.> And, crude as all hell!Remember the chess computers ? There were the brute force machines and a few "chess knowledge" machines that were not successfull. Speech is rather unstructured. The speech recognition systems in the 70ies that were rule-based "artificial intelligence" failed too. I would say MIT started with something like NRL in the 60ies, patched in exception after exception and reworked it to a dictionary/table system to clean up the mess. You are on that road too. MfG JRD
Reply by ●January 25, 20152015-01-25
> IIRC, Digitalker is also of that approximate vintage.There wasn't a text-to-speech system based on it as far as i know. http://www.embeddedFORTH.de/temp/mozer.pdf ( text is in german ) http://www.embeddedforth.de/temp/Demo.pdf ( text is in german ) Inflection control, prosody is much harder in a time-domain frontend. Commercially Digitalker was more usable then then Votrax or even LPC. Speech quality was fine for female voices like the one used in the Audi Quattro car. That was usually no true for LPC. A text-to-speech system based in LPC looks easier and TI seems to have worked on it: http://www.embeddedFORTH.de/temp/Articles on TI Speech Synthesis.pdf ( page 13 ) Perhaps there was a implementation on their home computer. -------- As for small ( 8 bit ) embedded controllers my view is that text-to-speech is less practical then prerecorded words with flat intonation like Digitalker. If the sentence is short: "channel - four - is - on" then flat robotic intonation is ok. Standard application vocabulary has been the talking clock: http://www.embeddedforth.de/temp/clock.pdf ( text is in german ) That goes back to Edison. While one can use PCM, ADPCM i would say that CVSD ist much more appropriate, because bitrate can more easily switched. MfG JRD
Reply by ●January 25, 20152015-01-25
On 1/25/2015 1:48 AM, Rafael Deliano wrote:>> IIRC, pure software (formant) synthesizers weren't really practical until >> closer to 1980 (Klatt et al.). > > To quote from > Hertz �SRS Text-To-Phoneme > Rules A Three Level Rule > Strategy� ICASSP 1981 > �A number of English text-tophoneme > systems exist. MITalk-79, for > example, uses a large morph dictionary > and a small set of rules. The system is > very accurate, but it is too large for > many applications. The Naval Research > Laboratory (NRL) system, on the other > hand uses only a small set of rules with > no dictionary. These rules require little > storage space, but do not perform with > the kind of accuracy that is desirable for > most appications. Other text-to-phoneme > strategies generally lie somewhere > between these two extremes.�Yes, no news here...> Klatt was MITalk-79 > http://www.embeddedFORTH.de/temp/klatt.pdf ( text in german ) > The fontend was on a TMS320 DSP and much > more elaborate then the Votrax I/II. > Notice the large bank of EPROMs._From Text to Speech: The MITalk System_ discusses Allen, Hunnicutt and Klatt's work -- in reasonable detail. Other papers are available with clues as to more of what's under the hood. E.g., it was in Hunnicutt's ruleset that I was fighting typos. Note that the Votrax VS6 predates KlattTalk (MITalk, DECtalk). [I have a DTC-1, DECtalk Express, Type 'N' Talk, IntexTalker, PSS, SC01A, SP0256/CTS256, Digitalker, etc. -- and that doesn't count the *software*-based synthesizers! I've been at this for a few decades...]> The source for NRL > http://www.embeddedFORTH.de/temp/elovitz.pdf ( text in german ) > was originally published in SNOBOL but > ported to other languages > http://www.embeddedFORTH.de/temp/morris.pdfMany of the ports failed to understand the subtleties inherent in SNOBOL and, as such, have bugs in their pattern matching algorithms. Likewise, many of the ports of Klatt's synthesizer failed to understand what was *really* happening under the hood and just blindly tried to reimplement his (FORTRAN) implementation.> The General Instruments SP0256-AL2 "Votrax clone" had > a controller CTS256-AL2 often found at ebay that > implemented a version of it. > >> And, crude as all hell! > > Remember the chess computers ? There were the brute > force machines and a few "chess knowledge" machines > that were not successfull. > Speech is rather unstructured. The speech recognition > systems in the 70ies that were rule-based "artificial > intelligence" failed too.*General* speech (and, with that, only talking about *English* -- other languages are far more "well behaved") is unstructured. But, that doesn't apply to *all* speech.> I would say MIT started with something like NRL in the > 60ies, patched in exception after exception and > reworked it to a dictionary/table system to clean up > the mess.Lee and Allen both pursued speech (with similar goals -- reading machine for the blind). Note that Kurzweil opted to *use* the VS6 in The Reading Machine. Though there was work on trying to enhance a Digitalker to bring the design "in-house" (Votrax board sets were expensive, crude and tended to fizzle out for unknown reasons -- being potted meant you had no option but to return them to the factory for a replacement) Interesting that he did *not* pursue MITalk/KlattTalk given as they were half a dozen city blocks away (perhaps DEC was already involved in that venture). OTOH, the "Personal Reader" has a custom DECTalk implementation within. Note that folks who listened to the KRM (Votrax) soon developed a much improved sense of comprehension. As it's speech patterns were methodical, you could learn that and exploit it to enhance your understanding (experienced users would complain that the machine wasn't *fast* enough -- even when set to speak at its maximum speed! IIRC, about 300 WPM?)> You are on that road too.You appear to have missed the comment in my initial post: "As I can operate them in semi-limited domains (since I *tend* ------------------------------^^^^^^^^^^^^^^^^^^^^ to be the source of the text they utter), I've opted to adopt some of the classic approaches to the text-to-phoneme portion of the algorithms" so, I am clearly *not* on THAT road! :> As I said later in that same post: "even bloated synthesizers have problems with speech so why waste effort trying to *approach* their performance levels with 1% of *their* resources?!"
Reply by ●January 25, 20152015-01-25
On 1/25/2015 2:45 AM, Rafael Deliano wrote:>> IIRC, Digitalker is also of that approximate vintage. > > There wasn't a text-to-speech system based on it > as far as i know. > > http://www.embeddedFORTH.de/temp/mozer.pdf ( text is in german ) > http://www.embeddedforth.de/temp/Demo.pdf ( text is in german ) > > Inflection control, prosody is much harder in a > time-domain frontend. > Commercially Digitalker was more usable then > then Votrax or even LPC. Speech quality was fine > for female voices like the one used in the Audi > Quattro car. That was usually no true for LPC.Digitalker and Votrax addressed different markets. Unconstrained text with a Digitalker would be impractical. OTOH, the Votrax could at least make a *stab* at it (subject to how well you could map glyphs to phonemes and control the inflection with your algorithm). E.g., the VS6.G could also *sing*!> A text-to-speech system based in LPC looks easier > and TI seems to have worked on it: > http://www.embeddedFORTH.de/temp/Articles on TI Speech Synthesis.pdf > ( page 13 ) > Perhaps there was a implementation on their home > computer.The 9900 had a synthesizer. LPC is also limited domain applications. I.e., great for "The cow goes 'Moooo'!" (Speak n Spell)> As for small ( 8 bit ) embedded controllers my view > is that text-to-speech is less practical then prerecorded > words with flat intonation like Digitalker. > If the sentence is short: "channel - four - is - on" > then flat robotic intonation is ok.Recording speech is the most limited domain approach possible. You have to know *every* utterance that you are likely to encounter "at run time". As you say below, it really only makes sense for small, limited vocabularies (like clocks!)> Standard application vocabulary has been the talking > clock: > http://www.embeddedforth.de/temp/clock.pdf ( text is in german ) > That goes back to Edison. > > While one can use PCM, ADPCM i would say that CVSD > ist much more appropriate, because bitrate can more > easily switched.CVSD was used in video games in the early 80's. Even clocking the devices at their maximum rates left much to be desired in the quality of the speech that resulted. A formant-based synthesizer is very practical with current technology. And, simple text-to-phoneme algorithms can give a lot of coverage for "nominal" utterances. The two, combined, seem to be the most effective way to get speech in a small footprint. Even simple algorithms to impose prosodic structure to complex phrases is relatively easy and effective. But, all this presupposes "nominal" text. E.g., reading source code listings would end up sounding like Qbert! Ths is where having "semi-limited" domains can leverage the basic performance of "cheap" synthesis without stumbling over the complexities in more general speech: "Dr. Reed had already read the Polish language book that he was reading when I drove up to his house on Reading Dr. and caught him polishing off a small brandy." Fit the solution to the problem.
Reply by ●January 25, 20152015-01-25
> CVSD was used in video games in the early 80's. Even clocking > the devices at their maximum rates left much to be desired in the > quality of the speech that resulted.CVSD at 16kbit/sec that the military used was poor. We used CVSD at 24kBit/sec in answering machines sold by Philips, AEG for cars in 1985. These were the early analog cellular radios for the telephone network here in Germany. CVSD is noisy, but the analog radio-link was noisy too. So the public didn't mind. Simple chips from Harris and CML, nothing complicated. This was a viable application with many thousand units sold. The company did earlier try to sell products based on the SC01A but there was no market for that sort of speech.> "Dr. Reed had already read the Polish language book that he > was reading when I drove up to his house on Reading Dr. and > caught him polishing off a small brandy."I doubt that messages of that length are typical for commercial applications. > You have to know *every* utterance that you are likely to > encounter "at run time". The typical commercial speech output applications do not have a "at run time" requirement. The biggest vocabulary i can think of is speech output in GPS car navigation. Digitalker demoboards had the complete vocabulary in ROM, but that predates FLASH. For a small fixed language application the user has to have access to a database that contains "all" words. He selects what he needs and downloads them to his hardware. Of course there is the old problem to get the words done in a recording studio. But ( copyright niceties aside ) i have found language trainer CDs where someone utters a word in German and then in English reasonably good testing material. True, "the most common 2000 words of the english language" do not contain everything the application vocabulary requires. But then some butchering ... MfG JRD







