Reply by Richard Bos September 15, 20172017-09-15
Robert Wessel <robertwessel2@yahoo.com> wrote:

> On Sat, 09 Sep 2017 12:37:46 GMT, raltbos@xs4all.nl (Richard Bos) > >Robert Wessel <robertwessel2@yahoo.com> wrote: > > > >> A problem is when a change needs to be made to only some translations > >> of a message. Let's say the English one was awkwardly worded, and > >> thus modified, but (some of) the other translations don't need to be > >> changed (although they should probably be reviewed). > > > >I'd like to point out that none of that is a technical problem. It's a > >management (and sometimes political) problem, and therefore needs a > >solution at a management level, not primarily a technical one. There may > >be technical _aids_ to the managerial solution, but not a technical > >solution /per se/. > > Sure, but you can say the same about source code and configuration > management too.
Yes, and I do. If you can't do a desk check, but require a colour-coding IDE to read your code, you're not a programmer, you're a hack. Liking syntax colouring is fine, it's needing it which is the problem: the IDE should aid your code comprehension, not create it. Similarly, if you can't work with two people on one problem without needing to go through Github, you don't have enough social skills to be a team player. Two thousand is another matter, but there, too, the basic problem is organising the teams, _not_ setting up the Git repository. Richard
Reply by Robert Wessel September 9, 20172017-09-09
On Sat, 09 Sep 2017 12:37:46 GMT, raltbos@xs4all.nl (Richard Bos)
wrote:

>Robert Wessel <robertwessel2@yahoo.com> wrote: > >> On Tue, 05 Sep 2017 15:17:56 -0700, Keith Thompson <kst-u@mib.org> >> >> >Robert Wessel <robertwessel2@yahoo.com> writes: > >> >> Well yes, (and ours are text based), but I've yet to see a SCM that >> >> can tell you that someone updated the English version of message#14, >> >> but hasn't validated or updated the Italian one yet. You can >> >> certainly get a diff and manually see where changes have been made, >> >> but that still leaves you with a manual comparison to the >> >> translations, and no way of tracking that the validations have been >> >> done. >> > >> >The "git blame" command tells you, for each line in a file, when that >> >line was most recently modified. Other SCMs have similar tools. >> >I imagine you could build some tools on top of that that could >> >warn you, for example, that the English version of message #14 was >> >updated yesterday but the Italian version hasn't been changed in >> >the last year. >> >> A problem is when a change needs to be made to only some translations >> of a message. Let's say the English one was awkwardly worded, and >> thus modified, but (some of) the other translations don't need to be >> changed (although they should probably be reviewed). > >I'd like to point out that none of that is a technical problem. It's a >management (and sometimes political) problem, and therefore needs a >solution at a management level, not primarily a technical one. There may >be technical _aids_ to the managerial solution, but not a technical >solution /per se/.
Sure, but you can say the same about source code and configuration management too.
Reply by Richard Bos September 9, 20172017-09-09
Robert Wessel <robertwessel2@yahoo.com> wrote:

> On Tue, 05 Sep 2017 15:17:56 -0700, Keith Thompson <kst-u@mib.org> > > >Robert Wessel <robertwessel2@yahoo.com> writes:
> >> Well yes, (and ours are text based), but I've yet to see a SCM that > >> can tell you that someone updated the English version of message#14, > >> but hasn't validated or updated the Italian one yet. You can > >> certainly get a diff and manually see where changes have been made, > >> but that still leaves you with a manual comparison to the > >> translations, and no way of tracking that the validations have been > >> done. > > > >The "git blame" command tells you, for each line in a file, when that > >line was most recently modified. Other SCMs have similar tools. > >I imagine you could build some tools on top of that that could > >warn you, for example, that the English version of message #14 was > >updated yesterday but the Italian version hasn't been changed in > >the last year. > > A problem is when a change needs to be made to only some translations > of a message. Let's say the English one was awkwardly worded, and > thus modified, but (some of) the other translations don't need to be > changed (although they should probably be reviewed).
I'd like to point out that none of that is a technical problem. It's a management (and sometimes political) problem, and therefore needs a solution at a management level, not primarily a technical one. There may be technical _aids_ to the managerial solution, but not a technical solution /per se/. Richard
Reply by Mel Wilson September 8, 20172017-09-08
On Fri, 08 Sep 2017 10:43:48 -0400, Richard Damon wrote:
> On 9/5/17 3:30 AM, pozz wrote:
>> The _() function will search the translated string based on the current >> language. If he can't find, it could return the first translation >> (english). >> >> This approach has some disadvantages. It's difficult to exclude one >> language from the build. If the languages are more than a couple, the >> strings will be very long. The order of the translations (first >> english, than italian, ...) is important and you have to remember it >> for every string. >> >> >> What approach do you use? > > If you really want to look by string, then your _ function just needs to > search the translation table for that value, and then return the string > desired translation instead of passing in the index. The lookup will > take a bit of time, but not that long given your number of strings, and > if you sort the strings by the base translation, you could binary search > to find it.
This may have been said already (who can tell, here), but the gettext convention _("Some string that has to be translated") can make it easy for some preprocessors of your own to pick out these strings, manage a database of translations, and substitute the translations into individual builds for each language. The '_' can just be a syntactic marker -- doesn't need to be a callable function at all.
Reply by David Brown September 8, 20172017-09-08
On 08/09/17 15:48, bartc wrote:
> On 08/09/2017 12:37, David Brown wrote: >> On 08/09/17 12:32, bartc wrote: > >>> The method is basically this, assuming you have tables of messages for >>> all languages in memory: >>> >>> * Take an English message M >>> * Look it up in the English table, to get index N (with 100 messages, >>> a linear search will do) >>> * If N is in range, return the string from table[N] for language L >>> * If M wasn't found, just return M, the English version. >>> >>> So, probably a 10 or 20 line function. >> >> That can be okay for a starting point, but it has a /big/ problem - you >> only get one entry for each original English language message. When you >> are translating messages, it is not uncommon to encounter different >> messages with the same text in the original language but different texts >> in the translations. In gettext, this is done by including __FILE__ and >> __LINE__ in the lookup. > > The OP said there are 50-100 messages. Then any clashes (of the same > English text with different meanings) can be handled manually. > > But my scheme with references numbers can fix that. That can be extended > to annotate messages give a general method of disambiguating messages > with multiple meanings.
Indeed it will handle it. But it means you have to have numbers in the code, and match it up with numbers in the translation files. Once you start having that sort of thing, you lose the benefits of having a simple direct text in the code. So you might as well cut out that text in the code and put it in the messages file. And then you might as well use an enumerated type - then instead of arbitrary numbers with no connection to indexing and manual checking for collisions, you have a header file with the enumerated type defined, symbols with useful names (like "str_hello_world"), checking by the compiler for errors, automatic completion from within your IDE, and fast and simple lookup in the actual table.
> >>> >>> This would require two copies of each English message, one in the >>> source, and one in a searchable table. And that needs maintenance. >>> >>> You might be able to get around that by embedding a serial number in >>> each English message: >>> >>> puts(_("Please enter filename: !078")); >>> >>> Here the '!078" is the number and does not appear (or you can use {78} >>> etc, any scheme will do). >> >> Maintenance of the string numbers here is a hassle. > > No, the numbers can be anything, including any text, or can be > annotations.
Yes, it is a hassle - you have to be sure there are no conflicts, and you have to match them up in your translation file. That's easy for a small program, but scales poorly and cannot be checked by the compiler.
> But in this scheme, every message must have an annotation, > and that will can the appearance of the message within the source. > > >>> Another reason to forget using anyone else's library. >> >> No, it is another reason to look at the licensing before using other >> libraries. People write libraries with the intention of letting other >> people use them - you just need to make sure the licensing is suitable. > > How does that library deal with the issues of extracting the messages in > a format that can be submitted to a translator (who might be in a > different country), and what format are they sent back in, or submitted > to the program?
Do you mean gettext in particular, or some arbitrary library in general? The licensing issue was a general point. gettext comes with tools to aid translating and maintaining the translation files. I'll let you look up the details - there is little point in having me copy-and-paste stuff off the web.
> > What about when the program is revised, and messages are deleted, added > or modified? > > How does it deal with multiple instances of the same message that differ > only in leading or trailing punctuation or capitalisation? Do multiple > messages have to be provided? > > What about the problem raised above of the same English words having a > different meaning depending on context?
It is all handled by gettext. Whether you like or dislike the way it is handled, is up to you. Again, look up the details if you want.
> > I looked at docs for gettext and it's a 275 pages in PDF format; 378 > pages in Word. How many messages did the OP want to deal with again? >
Since gettext is quite big, and has a license unsuitable for most embedded software, it is unlikely to be the answer for the OP. It is a /heavyweight/ solution. It might help inspire ideas for the OP, but it is not a practical choice for him. It is, however, and excellent choice for many other projects and programs. And there are several other gettext-like libraries around, which might be a workable choice depending on what the OP likes.
> (The scheme I outlined in my first post in the thread dealt with all > this. And it totalled a few hundred lines of code. Actually I don't > think I needed the translations at all; they could be loaded locally > from a file, with an existing version of the application, so no > intervention was needed.) >
The scheme you outlined is a possibility, and I'm sure the OP will consider it. It is not the way /I/ would do it (I mentioned that in my first reply in the thread), but it could work.
Reply by Richard Damon September 8, 20172017-09-08
On 9/5/17 3:30 AM, pozz wrote:
> [This message is posted to comp.arch.embedded and comp.lang.c] > > Just for reference, an embedded platform based on a MCU with integrated > Flash, for example a Cortex-Mx device. > Here I consider only western languages (left-to-right and european > chars, english, french, german, spanish and so on). > > The main problem is the translation of strings, maybe 10-100 strings. > > I know something about gettext package that can't be used in those > embeded platforms. However I like the approach of gettext. > > &#4294967295; print_to_display(x, y, "Hello world!"); > > is simply changed in: > > &#4294967295; #include <gettext.h> > &#4294967295; ... > &#4294967295; print_to_display(x, y, _("Hello world!")); > > In this way, the code stays highly readble as before introducing the > multi-language support. If a member of a structure needs a string, it is > a char * as usual. > > The solution I found in embedded platforms is to use an array of array > of strings: one index for the string and one index for the language. > > &#4294967295; enum lang_t { ENGLISH, ITALIAN, LANG_N }; > &#4294967295; enum string_t { STR_HELLO_WORLD, STR_HOW_ARE_YOU }; > > &#4294967295; const char *strings[STRING_N][LANG_N] = { > &#4294967295;&#4294967295;&#4294967295; {&#4294967295; // STR_HELLO_WORLD > &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; { "Hello world!", "Ciao mondo" } > &#4294967295;&#4294967295;&#4294967295; }, > &#4294967295;&#4294967295;&#4294967295; {&#4294967295; // STR_HOW_ARE_YOU > &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295; { "How are you?", "Come stai?" } > &#4294967295;&#4294967295;&#4294967295; }, > &#4294967295; }; > > &#4294967295; static enum lang_t lang = ENGLISH; > > &#4294967295; const char *_(int string_idx) { > &#4294967295;&#4294967295;&#4294967295; return strings[string_idx][lang]; > &#4294967295; } > > &#4294967295; void set_language(enum lang_t new_language) { > &#4294967295;&#4294967295;&#4294967295; lang = new_language; > &#4294967295; } > > > I don't like too much this approach for two reasons. > The first, the line: > > &#4294967295; print_to_display(x, y, _(STR_HELLO_WORLD)); > > is much less readable than > > &#4294967295; print_to_display(x, y, _("Hello world!")); > > > The second, I need to change the type of some members/variables from > char * to int: > > &#4294967295; struct mystruct { > &#4294967295;&#4294967295;&#4294967295; int title;&#4294967295; // Instead of char *title > &#4294967295;&#4294967295;&#4294967295; ... > &#4294967295; }; > > > Another approach I'm thinking is to embed all the translations in the > string, using a separator character that can't be used in normal strings. > > &#4294967295; print_to_display(x, y, _("Hello world!|Ciao Mondo!")); > > The _() function will search the translated string based on the current > language. If he can't find, it could return the first translation > (english). > > This approach has some disadvantages. It's difficult to exclude one > language from the build. If the languages are more than a couple, the > strings will be very long. The order of the translations (first english, > than italian, ...) is important and you have to remember it for every > string. > > > What approach do you use?
If you really want to look by string, then your _ function just needs to search the translation table for that value, and then return the string desired translation instead of passing in the index. The lookup will take a bit of time, but not that long given your number of strings, and if you sort the strings by the base translation, you could binary search to find it.
Reply by bartc September 8, 20172017-09-08
On 08/09/2017 12:37, David Brown wrote:
> On 08/09/17 12:32, bartc wrote:
>> The method is basically this, assuming you have tables of messages for >> all languages in memory: >> >> * Take an English message M >> * Look it up in the English table, to get index N (with 100 messages, >> a linear search will do) >> * If N is in range, return the string from table[N] for language L >> * If M wasn't found, just return M, the English version. >> >> So, probably a 10 or 20 line function. > > That can be okay for a starting point, but it has a /big/ problem - you > only get one entry for each original English language message. When you > are translating messages, it is not uncommon to encounter different > messages with the same text in the original language but different texts > in the translations. In gettext, this is done by including __FILE__ and > __LINE__ in the lookup.
The OP said there are 50-100 messages. Then any clashes (of the same English text with different meanings) can be handled manually. But my scheme with references numbers can fix that. That can be extended to annotate messages give a general method of disambiguating messages with multiple meanings.
>> >> This would require two copies of each English message, one in the >> source, and one in a searchable table. And that needs maintenance. >> >> You might be able to get around that by embedding a serial number in >> each English message: >> >> puts(_("Please enter filename: !078")); >> >> Here the '!078" is the number and does not appear (or you can use {78} >> etc, any scheme will do). > > Maintenance of the string numbers here is a hassle.
No, the numbers can be anything, including any text, or can be annotations. But in this scheme, every message must have an annotation, and that will can the appearance of the message within the source.
>> Another reason to forget using anyone else's library. > > No, it is another reason to look at the licensing before using other > libraries. People write libraries with the intention of letting other > people use them - you just need to make sure the licensing is suitable.
How does that library deal with the issues of extracting the messages in a format that can be submitted to a translator (who might be in a different country), and what format are they sent back in, or submitted to the program? What about when the program is revised, and messages are deleted, added or modified? How does it deal with multiple instances of the same message that differ only in leading or trailing punctuation or capitalisation? Do multiple messages have to be provided? What about the problem raised above of the same English words having a different meaning depending on context? I looked at docs for gettext and it's a 275 pages in PDF format; 378 pages in Word. How many messages did the OP want to deal with again? (The scheme I outlined in my first post in the thread dealt with all this. And it totalled a few hundred lines of code. Actually I don't think I needed the translations at all; they could be loaded locally from a file, with an existing version of the application, so no intervention was needed.) -- bartc
Reply by David Brown September 8, 20172017-09-08
On 08/09/17 12:32, bartc wrote:
> On 08/09/2017 10:33, pozz wrote: >> Il 07/09/2017 19:55, Keith Thompson ha scritto: > >>> In a very quick look at the gettext sources, I see that the >>> gettext-runtime/src subdirectory contains about 1300 lines of C code. >>> If that's all that needs to run on the target system, you might even be >>> able to use it without modification. >> >> This is a good suggestion. I'm not an expert of gettext, however I >> remember it loads/search for right strings (based on current language) >> at runtime, looking at the content of a binary file (mo extension). >> >> In my embedded platform I don't have a real filesystem so I can't >> access "files" at runtime. >> >> Maybe I could add the mo files in the output binary file (the image of >> the Flash memory of the MCU) at exact locations and change gettext >> code to look at those fixed addresses instead of accessing files. >> >> Anyway thanks for the suggestions. > > gettext looks like a very heavy-duty approach. (A lot of these third > party solutions are. My experience was that a third party library that > took care of 5% of the functionality of my application, would be several > times bigger than my entire app.)
gettext /is/ a heavy-duty approach. It is designed to separate the program code and that translation texts, so that they can be written by different people, compiled separately, distributed separately, and (if desired) updated separately - because the binary and the translation files are all separate files. It is a very useful approach for many kinds of program - but too big and complex for what the OP wants, I believe.
> > The method is basically this, assuming you have tables of messages for > all languages in memory: > > * Take an English message M > * Look it up in the English table, to get index N (with 100 messages, > a linear search will do) > * If N is in range, return the string from table[N] for language L > * If M wasn't found, just return M, the English version. > > So, probably a 10 or 20 line function.
That can be okay for a starting point, but it has a /big/ problem - you only get one entry for each original English language message. When you are translating messages, it is not uncommon to encounter different messages with the same text in the original language but different texts in the translations. In gettext, this is done by including __FILE__ and __LINE__ in the lookup.
> > This would require two copies of each English message, one in the > source, and one in a searchable table. And that needs maintenance. > > You might be able to get around that by embedding a serial number in > each English message: > > puts(_("Please enter filename: !078")); > > Here the '!078" is the number and does not appear (or you can use {78} > etc, any scheme will do).
Maintenance of the string numbers here is a hassle.
> > Now you just have to search the table for language L for a message with > the same number. (You don't need to convert to an integer, just compare > the last few characters.) > > Of course, you need to return a string without the !078 etc in it. For > that purpose, it might be better to put this number at the start. Then > you return a string pointing to the just past the number. > > (See example below using such a scheme. This might give some ideas.) > > You can use the number as an actual index, but the maintenance becomes > harder. > > There is still the problem of producing a list of English messages for > translators to work from. But the format and ordering of that is not > critical. > >>> (Any licensing issues are left as an exercise.) >> >> And this is another good point to study. > > Another reason to forget using anyone else's library.
No, it is another reason to look at the licensing before using other libraries. People write libraries with the intention of letting other people use them - you just need to make sure the licensing is suitable.
Reply by bartc September 8, 20172017-09-08
On 08/09/2017 10:33, pozz wrote:
> Il 07/09/2017 19:55, Keith Thompson ha scritto:
>> In a very quick look at the gettext sources, I see that the >> gettext-runtime/src subdirectory contains about 1300 lines of C code. >> If that's all that needs to run on the target system, you might even be >> able to use it without modification. > > This is a good suggestion. I'm not an expert of gettext, however I > remember it loads/search for right strings (based on current language) > at runtime, looking at the content of a binary file (mo extension). > > In my embedded platform I don't have a real filesystem so I can't access > "files" at runtime. > > Maybe I could add the mo files in the output binary file (the image of > the Flash memory of the MCU) at exact locations and change gettext code > to look at those fixed addresses instead of accessing files. > > Anyway thanks for the suggestions.
gettext looks like a very heavy-duty approach. (A lot of these third party solutions are. My experience was that a third party library that took care of 5% of the functionality of my application, would be several times bigger than my entire app.) The method is basically this, assuming you have tables of messages for all languages in memory: * Take an English message M * Look it up in the English table, to get index N (with 100 messages, a linear search will do) * If N is in range, return the string from table[N] for language L * If M wasn't found, just return M, the English version. So, probably a 10 or 20 line function. This would require two copies of each English message, one in the source, and one in a searchable table. And that needs maintenance. You might be able to get around that by embedding a serial number in each English message: puts(_("Please enter filename: !078")); Here the '!078" is the number and does not appear (or you can use {78} etc, any scheme will do). Now you just have to search the table for language L for a message with the same number. (You don't need to convert to an integer, just compare the last few characters.) Of course, you need to return a string without the !078 etc in it. For that purpose, it might be better to put this number at the start. Then you return a string pointing to the just past the number. (See example below using such a scheme. This might give some ideas.) You can use the number as an actual index, but the maintenance becomes harder. There is still the problem of producing a list of English messages for translators to work from. But the format and ordering of that is not critical.
>> (Any licensing issues are left as an exercise.) > > And this is another good point to study.
Another reason to forget using anyone else's library. -------------------------------------------------------------- #include <stdio.h> #include <string.h> char *italian[] = { "1!uno", "2!due", "3!tre", "4!quattro", "5!fine", }; char *spanish[] = { "2!dos", "3!tres", "1!uno", "4!cuatro", "5!fin" }; char *german[] = { "4!vier", // ordering doesn't matter "5!Ende", "1!eins", "2!zwei", "3!drei" }; //char **currlang = italian; //char **currlang = german; char **currlang = spanish; //char **currlang = NULL; int nmessages=sizeof(italian)/sizeof(italian[0]); char* skipprefix(char* M){ char *s=M; while (*s!='!' && *s!=0) ++s; if (*s==0) return M; return s+1; } char* lookup(char* M){ char *s; int i,len; s=skipprefix(M); len=s-M; if (currlang==NULL || len==0) return s; // English for (i=0; i<nmessages; ++i) if (memcmp(currlang[i],M,len)==0) return skipprefix(currlang[i]); return s; // not found } int main(void) { int i; puts(lookup("1!one")); puts(lookup("2!two")); puts(lookup("3!three")); puts(lookup("5!end")); } -- bartc
Reply by pozz September 8, 20172017-09-08
Il 07/09/2017 19:55, Keith Thompson ha scritto:
> pozz <pozzugno@gmail.com> writes: >> [This message is posted to comp.arch.embedded and comp.lang.c] >> >> Just for reference, an embedded platform based on a MCU with >> integrated Flash, for example a Cortex-Mx device. >> Here I consider only western languages (left-to-right and european >> chars, english, french, german, spanish and so on). >> >> The main problem is the translation of strings, maybe 10-100 strings. >> >> I know something about gettext package that can't be used in those >> embeded platforms. However I like the approach of gettext. >> >> print_to_display(x, y, "Hello world!"); >> >> is simply changed in: >> >> #include <gettext.h> >> ... >> print_to_display(x, y, _("Hello world!")); > [snip] > > GNU gettext is free software, licensed under GPLv3. I wonder if you > could grab a copy of it, remove any functionality you don't need, and > end up with something small enough to work on your embedded system. > > In a very quick look at the gettext sources, I see that the > gettext-runtime/src subdirectory contains about 1300 lines of C code. > If that's all that needs to run on the target system, you might even be > able to use it without modification.
This is a good suggestion. I'm not an expert of gettext, however I remember it loads/search for right strings (based on current language) at runtime, looking at the content of a binary file (mo extension). In my embedded platform I don't have a real filesystem so I can't access "files" at runtime. Maybe I could add the mo files in the output binary file (the image of the Flash memory of the MCU) at exact locations and change gettext code to look at those fixed addresses instead of accessing files. Anyway thanks for the suggestions.
> (Any licensing issues are left as an exercise.)
And this is another good point to study.