Robert Wessel <robertwessel2@yahoo.com> wrote:

> On Sat, 09 Sep 2017 12:37:46 GMT, raltbos@xs4all.nl (Richard Bos)
> >Robert Wessel <robertwessel2@yahoo.com> wrote:
> >
> >> A problem is when a change needs to be made to only some translations
> >> of a message.  Let's say the English one was awkwardly worded, and
> >> thus modified, but (some of) the other translations don't need to be
> >> changed (although they should probably be reviewed).
> >
> >I'd like to point out that none of that is a technical problem. It's a
> >management (and sometimes political) problem, and therefore needs a
> >solution at a management level, not primarily a technical one. There may
> >be technical _aids_ to the managerial solution, but not a technical
> >solution /per se/.
> 
> Sure, but you can say the same about source code and configuration
> management too.

Yes, and I do. If you can't do a desk check, but require a colour-coding
IDE to read your code, you're not a programmer, you're a hack. Liking
syntax colouring is fine, it's needing it which is the problem: the IDE
should aid your code comprehension, not create it.

Similarly, if you can't work with two people on one problem without
needing to go through Github, you don't have enough social skills to be
a team player. Two thousand is another matter, but there, too, the basic
problem is organising the teams, _not_ setting up the Git repository.

Richard

On Sat, 09 Sep 2017 12:37:46 GMT, raltbos@xs4all.nl (Richard Bos)
wrote:

>Robert Wessel <robertwessel2@yahoo.com> wrote:
>
>> On Tue, 05 Sep 2017 15:17:56 -0700, Keith Thompson <kst-u@mib.org>
>> 
>> >Robert Wessel <robertwessel2@yahoo.com> writes:
>
>> >> Well yes, (and ours are text based), but I've yet to see a SCM that
>> >> can tell you that someone updated the English version of message#14,
>> >> but hasn't validated or updated the Italian one yet.  You can
>> >> certainly get a diff and manually see where changes have been made,
>> >> but that still leaves you with a manual comparison to the
>> >> translations, and no way of tracking that the validations have been
>> >> done.
>> >
>> >The "git blame" command tells you, for each line in a file, when that
>> >line was most recently modified.  Other SCMs have similar tools.
>> >I imagine you could build some tools on top of that that could
>> >warn you, for example, that the English version of message #14 was
>> >updated yesterday but the Italian version hasn't been changed in
>> >the last year.
>> 
>> A problem is when a change needs to be made to only some translations
>> of a message.  Let's say the English one was awkwardly worded, and
>> thus modified, but (some of) the other translations don't need to be
>> changed (although they should probably be reviewed).
>
>I'd like to point out that none of that is a technical problem. It's a
>management (and sometimes political) problem, and therefore needs a
>solution at a management level, not primarily a technical one. There may
>be technical _aids_ to the managerial solution, but not a technical
>solution /per se/.


Sure, but you can say the same about source code and configuration
management too.

Robert Wessel <robertwessel2@yahoo.com> wrote:

> On Tue, 05 Sep 2017 15:17:56 -0700, Keith Thompson <kst-u@mib.org>
> 
> >Robert Wessel <robertwessel2@yahoo.com> writes:

> >> Well yes, (and ours are text based), but I've yet to see a SCM that
> >> can tell you that someone updated the English version of message#14,
> >> but hasn't validated or updated the Italian one yet.  You can
> >> certainly get a diff and manually see where changes have been made,
> >> but that still leaves you with a manual comparison to the
> >> translations, and no way of tracking that the validations have been
> >> done.
> >
> >The "git blame" command tells you, for each line in a file, when that
> >line was most recently modified.  Other SCMs have similar tools.
> >I imagine you could build some tools on top of that that could
> >warn you, for example, that the English version of message #14 was
> >updated yesterday but the Italian version hasn't been changed in
> >the last year.
> 
> A problem is when a change needs to be made to only some translations
> of a message.  Let's say the English one was awkwardly worded, and
> thus modified, but (some of) the other translations don't need to be
> changed (although they should probably be reviewed).

I'd like to point out that none of that is a technical problem. It's a
management (and sometimes political) problem, and therefore needs a
solution at a management level, not primarily a technical one. There may
be technical _aids_ to the managerial solution, but not a technical
solution /per se/.

Richard

On Fri, 08 Sep 2017 10:43:48 -0400, Richard Damon wrote:
> On 9/5/17 3:30 AM, pozz wrote:

>> The _() function will search the translated string based on the current
>> language. If he can't find, it could return the first translation
>> (english).
>> 
>> This approach has some disadvantages. It's difficult to exclude one
>> language from the build. If the languages are more than a couple, the
>> strings will be very long. The order of the translations (first
>> english, than italian, ...) is important and you have to remember it
>> for every string.
>> 
>> 
>> What approach do you use?
> 
> If you really want to look by string, then your _ function just needs to
> search the translation table for that value, and then return the string
> desired translation instead of passing in the index. The lookup will
> take a bit of time, but not that long given your number of strings, and
> if you sort the strings by the base translation, you could binary search
> to find it.

This may have been said already (who can tell, here), but the gettext 
convention 
_("Some string that has to be translated")

can make it easy for some preprocessors of your own to pick out these 
strings, manage a database of translations, and substitute  the 
translations into individual builds for each language.  The '_' can just 
be a syntactic marker -- doesn't need to be a callable function at all.

On 08/09/17 15:48, bartc wrote:
> On 08/09/2017 12:37, David Brown wrote:
>> On 08/09/17 12:32, bartc wrote:
> 
>>> The method is basically this, assuming you have tables of messages for
>>> all languages in memory:
>>>
>>> * Take an English message M
>>> * Look it up in the English table, to get index N (with 100 messages,
>>>    a linear search will do)
>>> * If N is in range, return the string from table[N] for language L
>>> * If M wasn't found, just return M, the English version.
>>>
>>> So, probably a 10 or 20 line function.
>>
>> That can be okay for a starting point, but it has a /big/ problem - you
>> only get one entry for each original English language message.  When you
>> are translating messages, it is not uncommon to encounter different
>> messages with the same text in the original language but different texts
>> in the translations.  In gettext, this is done by including __FILE__ and
>> __LINE__ in the lookup.
> 
> The OP said there are 50-100 messages. Then any clashes (of the same
> English text with different meanings) can be handled manually.
> 
> But my scheme with references numbers can fix that. That can be extended
> to annotate messages give a general method of disambiguating messages
> with multiple meanings.

Indeed it will handle it.  But it means you have to have numbers in the
code, and match it up with numbers in the translation files.  Once you
start having that sort of thing, you lose the benefits of having a
simple direct text in the code.  So you might as well cut out that text
in the code and put it in the messages file.  And then you might as well
use an enumerated type - then instead of arbitrary numbers with no
connection to indexing and manual checking for collisions, you have a
header file with the enumerated type defined, symbols with useful names
(like "str_hello_world"), checking by the compiler for errors, automatic
completion from within your IDE, and fast and simple lookup in the
actual table.

> 
>>>
>>> This would require two copies of each English message, one in the
>>> source, and one in a searchable table. And that needs maintenance.
>>>
>>> You might be able to get around that by embedding a serial number in
>>> each English message:
>>>
>>>     puts(_("Please enter filename: !078"));
>>>
>>> Here the '!078" is the number and does not appear (or you can use {78}
>>> etc, any scheme will do).
>>
>> Maintenance of the string numbers here is a hassle.
> 
> No, the numbers can be anything, including any text, or can be
> annotations. 

Yes, it is a hassle - you have to be sure there are no conflicts, and
you have to match them up in your translation file.  That's easy for a
small program, but scales poorly and cannot be checked by the compiler.

> But in this scheme, every message must have an annotation,
> and that will can the appearance of the message within the source.
> 
> 
>>> Another reason to forget using anyone else's library.
>>
>> No, it is another reason to look at the licensing before using other
>> libraries.  People write libraries with the intention of letting other
>> people use them - you just need to make sure the licensing is suitable.
> 
> How does that library deal with the issues of extracting the messages in
> a format that can be submitted to a translator (who might be in a
> different country), and what format are they sent back in, or submitted
> to the program?

Do you mean gettext in particular, or some arbitrary library in general?
 The licensing issue was a general point.

gettext comes with tools to aid translating and maintaining the
translation files.  I'll let you look up the details - there is little
point in having me copy-and-paste stuff off the web.

> 
> What about when the program is revised, and messages are deleted, added
> or modified?
> 
> How does it deal with multiple instances of the same message that differ
> only in leading or trailing punctuation or capitalisation? Do multiple
> messages have to be provided?
> 
> What about the problem raised above of the same English words having a
> different meaning depending on context?

It is all handled by gettext.  Whether you like or dislike the way it is
handled, is up to you.  Again, look up the details if you want.

> 
> I looked at docs for gettext and it's a 275 pages in PDF format; 378
> pages in Word. How many messages did the OP want to deal with again?
> 

Since gettext is quite big, and has a license unsuitable for most
embedded software, it is unlikely to be the answer for the OP.  It is a
/heavyweight/ solution.  It might help inspire ideas for the OP, but it
is not a practical choice for him.  It is, however, and excellent choice
for many other projects and programs.  And there are several other
gettext-like libraries around, which might be a workable choice
depending on what the OP likes.

> (The scheme I outlined in my first post in the thread dealt with all
> this. And it totalled a few hundred lines of code. Actually I don't
> think I needed the translations at all; they could be loaded locally
> from a file, with an existing version of the application, so no
> intervention was needed.)
> 

The scheme you outlined is a possibility, and I'm sure the OP will
consider it.  It is not the way /I/ would do it (I mentioned that in my
first reply in the thread), but it could work.

On 9/5/17 3:30 AM, pozz wrote:
> [This message is posted to comp.arch.embedded and comp.lang.c]
> 
> Just for reference, an embedded platform based on a MCU with integrated 
> Flash, for example a Cortex-Mx device.
> Here I consider only western languages (left-to-right and european 
> chars, english, french, german, spanish and so on).
> 
> The main problem is the translation of strings, maybe 10-100 strings.
> 
> I know something about gettext package that can't be used in those 
> embeded platforms. However I like the approach of gettext.
> 
>  &#4294967295; print_to_display(x, y, "Hello world!");
> 
> is simply changed in:
> 
>  &#4294967295; #include <gettext.h>
>  &#4294967295; ...
>  &#4294967295; print_to_display(x, y, _("Hello world!"));
> 
> In this way, the code stays highly readble as before introducing the 
> multi-language support. If a member of a structure needs a string, it is 
> a char * as usual.
> 
> The solution I found in embedded platforms is to use an array of array 
> of strings: one index for the string and one index for the language.
> 
>  &#4294967295; enum lang_t { ENGLISH, ITALIAN, LANG_N };
>  &#4294967295; enum string_t { STR_HELLO_WORLD, STR_HOW_ARE_YOU };
> 
>  &#4294967295; const char *strings[STRING_N][LANG_N] = {
>  &#4294967295;&#4294967295;&#4294967295; {&#4294967295; // STR_HELLO_WORLD
>  &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; { "Hello world!", "Ciao mondo" }
>  &#4294967295;&#4294967295;&#4294967295; },
>  &#4294967295;&#4294967295;&#4294967295; {&#4294967295; // STR_HOW_ARE_YOU
>  &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; { "How are you?", "Come stai?" }
>  &#4294967295;&#4294967295;&#4294967295; },
>  &#4294967295; };
> 
>  &#4294967295; static enum lang_t lang = ENGLISH;
> 
>  &#4294967295; const char *_(int string_idx) {
>  &#4294967295;&#4294967295;&#4294967295; return strings[string_idx][lang];
>  &#4294967295; }
> 
>  &#4294967295; void set_language(enum lang_t new_language) {
>  &#4294967295;&#4294967295;&#4294967295; lang = new_language;
>  &#4294967295; }
> 
> 
> I don't like too much this approach for two reasons.
> The first, the line:
> 
>  &#4294967295; print_to_display(x, y, _(STR_HELLO_WORLD));
> 
> is much less readable than
> 
>  &#4294967295; print_to_display(x, y, _("Hello world!"));
> 
> 
> The second, I need to change the type of some members/variables from 
> char * to int:
> 
>  &#4294967295; struct mystruct {
>  &#4294967295;&#4294967295;&#4294967295; int title;&#4294967295; // Instead of char *title
>  &#4294967295;&#4294967295;&#4294967295; ...
>  &#4294967295; };
> 
> 
> Another approach I'm thinking is to embed all the translations in the 
> string, using a separator character that can't be used in normal strings.
> 
>  &#4294967295; print_to_display(x, y, _("Hello world!|Ciao Mondo!"));
> 
> The _() function will search the translated string based on the current 
> language. If he can't find, it could return the first translation 
> (english).
> 
> This approach has some disadvantages. It's difficult to exclude one 
> language from the build. If the languages are more than a couple, the 
> strings will be very long. The order of the translations (first english, 
> than italian, ...) is important and you have to remember it for every 
> string.
> 
> 
> What approach do you use?

If you really want to look by string, then your _ function just needs to 
search the translation table for that value, and then return the string 
desired translation instead of passing in the index. The lookup will 
take a bit of time, but not that long given your number of strings, and 
if you sort the strings by the base translation, you could binary search 
to find it.

On 08/09/2017 12:37, David Brown wrote:
> On 08/09/17 12:32, bartc wrote:

>> The method is basically this, assuming you have tables of messages for
>> all languages in memory:
>>
>> * Take an English message M
>> * Look it up in the English table, to get index N (with 100 messages,
>>    a linear search will do)
>> * If N is in range, return the string from table[N] for language L
>> * If M wasn't found, just return M, the English version.
>>
>> So, probably a 10 or 20 line function.
> 
> That can be okay for a starting point, but it has a /big/ problem - you
> only get one entry for each original English language message.  When you
> are translating messages, it is not uncommon to encounter different
> messages with the same text in the original language but different texts
> in the translations.  In gettext, this is done by including __FILE__ and
> __LINE__ in the lookup.

The OP said there are 50-100 messages. Then any clashes (of the same 
English text with different meanings) can be handled manually.

But my scheme with references numbers can fix that. That can be extended 
to annotate messages give a general method of disambiguating messages 
with multiple meanings.

>>
>> This would require two copies of each English message, one in the
>> source, and one in a searchable table. And that needs maintenance.
>>
>> You might be able to get around that by embedding a serial number in
>> each English message:
>>
>>     puts(_("Please enter filename: !078"));
>>
>> Here the '!078" is the number and does not appear (or you can use {78}
>> etc, any scheme will do).
> 
> Maintenance of the string numbers here is a hassle.

No, the numbers can be anything, including any text, or can be 
annotations. But in this scheme, every message must have an annotation, 
and that will can the appearance of the message within the source.

>> Another reason to forget using anyone else's library.
> 
> No, it is another reason to look at the licensing before using other
> libraries.  People write libraries with the intention of letting other
> people use them - you just need to make sure the licensing is suitable.

How does that library deal with the issues of extracting the messages in 
a format that can be submitted to a translator (who might be in a 
different country), and what format are they sent back in, or submitted 
to the program?

What about when the program is revised, and messages are deleted, added 
or modified?

How does it deal with multiple instances of the same message that differ 
only in leading or trailing punctuation or capitalisation? Do multiple 
messages have to be provided?

What about the problem raised above of the same English words having a 
different meaning depending on context?

I looked at docs for gettext and it's a 275 pages in PDF format; 378 
pages in Word. How many messages did the OP want to deal with again?

(The scheme I outlined in my first post in the thread dealt with all 
this. And it totalled a few hundred lines of code. Actually I don't 
think I needed the translations at all; they could be loaded locally 
from a file, with an existing version of the application, so no 
intervention was needed.)

-- 
bartc

On 08/09/17 12:32, bartc wrote:
> On 08/09/2017 10:33, pozz wrote:
>> Il 07/09/2017 19:55, Keith Thompson ha scritto:
> 
>>> In a very quick look at the gettext sources, I see that the
>>> gettext-runtime/src subdirectory contains about 1300 lines of C code.
>>> If that's all that needs to run on the target system, you might even be
>>> able to use it without modification.
>>
>> This is a good suggestion. I'm not an expert of gettext, however I
>> remember it loads/search for right strings (based on current language)
>> at runtime, looking at the content of a binary file (mo extension).
>>
>> In my embedded platform I don't have a real filesystem so I can't
>> access "files" at runtime.
>>
>> Maybe I could add the mo files in the output binary file (the image of
>> the Flash memory of the MCU) at exact locations and change gettext
>> code to look at those fixed addresses instead of accessing files.
>>
>> Anyway thanks for the suggestions.
> 
> gettext looks like a very heavy-duty approach. (A lot of these third
> party solutions are. My experience was that a third party library that
> took care of 5% of the functionality of my application, would be several
> times bigger than my entire app.)

gettext /is/ a heavy-duty approach.  It is designed to separate the
program code and that translation texts, so that they can be written by
different people, compiled separately, distributed separately, and (if
desired) updated separately - because the binary and the translation
files are all separate files.  It is a very useful approach for many
kinds of program - but too big and complex for what the OP wants, I believe.

> 
> The method is basically this, assuming you have tables of messages for
> all languages in memory:
> 
> * Take an English message M
> * Look it up in the English table, to get index N (with 100 messages,
>   a linear search will do)
> * If N is in range, return the string from table[N] for language L
> * If M wasn't found, just return M, the English version.
> 
> So, probably a 10 or 20 line function.

That can be okay for a starting point, but it has a /big/ problem - you
only get one entry for each original English language message.  When you
are translating messages, it is not uncommon to encounter different
messages with the same text in the original language but different texts
in the translations.  In gettext, this is done by including __FILE__ and
__LINE__ in the lookup.

> 
> This would require two copies of each English message, one in the
> source, and one in a searchable table. And that needs maintenance.
> 
> You might be able to get around that by embedding a serial number in
> each English message:
> 
>    puts(_("Please enter filename: !078"));
> 
> Here the '!078" is the number and does not appear (or you can use {78}
> etc, any scheme will do).

Maintenance of the string numbers here is a hassle.

> 
> Now you just have to search the table for language L for a message with
> the same number. (You don't need to convert to an integer, just compare
> the last few characters.)
> 
> Of course, you need to return a string without the !078 etc in it. For
> that purpose, it might be better to put this number at the start. Then
> you return a string pointing to the just past the number.
> 
> (See example below using such a scheme. This might give some ideas.)
> 
> You can use the number as an actual index, but the maintenance becomes
> harder.
> 
> There is still the problem of producing a list of English messages for
> translators to work from. But the format and ordering of that is not
> critical.
> 
>>> (Any licensing issues are left as an exercise.)
>>
>> And this is another good point to study.
> 
> Another reason to forget using anyone else's library.

No, it is another reason to look at the licensing before using other
libraries.  People write libraries with the intention of letting other
people use them - you just need to make sure the licensing is suitable.

On 08/09/2017 10:33, pozz wrote:
> Il 07/09/2017 19:55, Keith Thompson ha scritto:

>> In a very quick look at the gettext sources, I see that the
>> gettext-runtime/src subdirectory contains about 1300 lines of C code.
>> If that's all that needs to run on the target system, you might even be
>> able to use it without modification.
> 
> This is a good suggestion. I'm not an expert of gettext, however I 
> remember it loads/search for right strings (based on current language) 
> at runtime, looking at the content of a binary file (mo extension).
> 
> In my embedded platform I don't have a real filesystem so I can't access 
> "files" at runtime.
> 
> Maybe I could add the mo files in the output binary file (the image of 
> the Flash memory of the MCU) at exact locations and change gettext code 
> to look at those fixed addresses instead of accessing files.
> 
> Anyway thanks for the suggestions.

gettext looks like a very heavy-duty approach. (A lot of these third 
party solutions are. My experience was that a third party library that 
took care of 5% of the functionality of my application, would be several 
times bigger than my entire app.)

The method is basically this, assuming you have tables of messages for 
all languages in memory:

* Take an English message M
* Look it up in the English table, to get index N (with 100 messages,
   a linear search will do)
* If N is in range, return the string from table[N] for language L
* If M wasn't found, just return M, the English version.

So, probably a 10 or 20 line function.

This would require two copies of each English message, one in the 
source, and one in a searchable table. And that needs maintenance.

You might be able to get around that by embedding a serial number in 
each English message:

    puts(_("Please enter filename: !078"));

Here the '!078" is the number and does not appear (or you can use {78} 
etc, any scheme will do).

Now you just have to search the table for language L for a message with 
the same number. (You don't need to convert to an integer, just compare 
the last few characters.)

Of course, you need to return a string without the !078 etc in it. For 
that purpose, it might be better to put this number at the start. Then 
you return a string pointing to the just past the number.

(See example below using such a scheme. This might give some ideas.)

You can use the number as an actual index, but the maintenance becomes 
harder.

There is still the problem of producing a list of English messages for 
translators to work from. But the format and ordering of that is not 
critical.

>> (Any licensing issues are left as an exercise.)
> 
> And this is another good point to study.

Another reason to forget using anyone else's library.

--------------------------------------------------------------

#include <stdio.h>
#include <string.h>

char *italian[] = {
     "1!uno",
     "2!due",
     "3!tre",
     "4!quattro",
     "5!fine",
};
char *spanish[] = {
     "2!dos",
     "3!tres",
     "1!uno",
     "4!cuatro",
     "5!fin"
};
char *german[] = {
     "4!vier",         // ordering doesn't matter
     "5!Ende",
     "1!eins",
     "2!zwei",
     "3!drei"
};

//char **currlang = italian;
//char **currlang = german;
char **currlang = spanish;
//char **currlang = NULL;

int nmessages=sizeof(italian)/sizeof(italian[0]);

char* skipprefix(char* M){
     char *s=M;
     while (*s!='!' && *s!=0) ++s;
     if (*s==0) return M;
     return s+1;
}

char* lookup(char* M){
     char *s;
     int i,len;

     s=skipprefix(M);
     len=s-M;
     if (currlang==NULL || len==0) return s;     // English

     for (i=0; i<nmessages; ++i)
         if (memcmp(currlang[i],M,len)==0)
             return skipprefix(currlang[i]);
     return s;                                   // not found
}

int main(void) {
     int i;

     puts(lookup("1!one"));
     puts(lookup("2!two"));
     puts(lookup("3!three"));
     puts(lookup("5!end"));
}

-- 
bartc

Il 07/09/2017 19:55, Keith Thompson ha scritto:
> pozz <pozzugno@gmail.com> writes:
>> [This message is posted to comp.arch.embedded and comp.lang.c]
>>
>> Just for reference, an embedded platform based on a MCU with
>> integrated Flash, for example a Cortex-Mx device.
>> Here I consider only western languages (left-to-right and european
>> chars, english, french, german, spanish and so on).
>>
>> The main problem is the translation of strings, maybe 10-100 strings.
>>
>> I know something about gettext package that can't be used in those
>> embeded platforms. However I like the approach of gettext.
>>
>>    print_to_display(x, y, "Hello world!");
>>
>> is simply changed in:
>>
>>    #include <gettext.h>
>>    ...
>>    print_to_display(x, y, _("Hello world!"));
> [snip]
> 
> GNU gettext is free software, licensed under GPLv3.  I wonder if you
> could grab a copy of it, remove any functionality you don't need, and
> end up with something small enough to work on your embedded system.
> 
> In a very quick look at the gettext sources, I see that the
> gettext-runtime/src subdirectory contains about 1300 lines of C code.
> If that's all that needs to run on the target system, you might even be
> able to use it without modification.

This is a good suggestion. I'm not an expert of gettext, however I 
remember it loads/search for right strings (based on current language) 
at runtime, looking at the content of a binary file (mo extension).

In my embedded platform I don't have a real filesystem so I can't access 
"files" at runtime.

Maybe I could add the mo files in the output binary file (the image of 
the Flash memory of the MCU) at exact locations and change gettext code 
to look at those fixed addresses instead of accessing files.

Anyway thanks for the suggestions.

> (Any licensing issues are left as an exercise.)

And this is another good point to study.