Multi-language support on embedded plarforms

Started by pozz September 5, 2017
[This message is posted to comp.arch.embedded and comp.lang.c]

Just for reference, an embedded platform based on a MCU with integrated 
Flash, for example a Cortex-Mx device.
Here I consider only western languages (left-to-right and european 
chars, english, french, german, spanish and so on).

The main problem is the translation of strings, maybe 10-100 strings.

I know something about gettext package that can't be used in those 
embeded platforms. However I like the approach of gettext.

   print_to_display(x, y, "Hello world!");

is simply changed in:

   #include <gettext.h>
   ...
   print_to_display(x, y, _("Hello world!"));

In this way, the code stays highly readble as before introducing the 
multi-language support. If a member of a structure needs a string, it is 
a char * as usual.

The solution I found in embedded platforms is to use an array of array 
of strings: one index for the string and one index for the language.

   enum lang_t { ENGLISH, ITALIAN, LANG_N };
   enum string_t { STR_HELLO_WORLD, STR_HOW_ARE_YOU };

   const char *strings[STRING_N][LANG_N] = {
     {  // STR_HELLO_WORLD
       { "Hello world!", "Ciao mondo" }
     },
     {  // STR_HOW_ARE_YOU
       { "How are you?", "Come stai?" }
     },
   };

   static enum lang_t lang = ENGLISH;

   const char *_(int string_idx) {
     return strings[string_idx][lang];
   }

   void set_language(enum lang_t new_language) {
     lang = new_language;
   }


I don't like too much this approach for two reasons.
The first, the line:

   print_to_display(x, y, _(STR_HELLO_WORLD));

is much less readable than

   print_to_display(x, y, _("Hello world!"));


The second, I need to change the type of some members/variables from 
char * to int:

   struct mystruct {
     int title;  // Instead of char *title
     ...
   };


Another approach I'm thinking is to embed all the translations in the 
string, using a separator character that can't be used in normal strings.

   print_to_display(x, y, _("Hello world!|Ciao Mondo!"));

The _() function will search the translated string based on the current 
language. If he can't find, it could return the first translation (english).

This approach has some disadvantages. It's difficult to exclude one 
language from the build. If the languages are more than a couple, the 
strings will be very long. The order of the translations (first english, 
than italian, ...) is important and you have to remember it for every 
string.


What approach do you use?
On 05/09/17 09:30, pozz wrote:
> [This message is posted to comp.arch.embedded and comp.lang.c] > > Just for reference, an embedded platform based on a MCU with integrated > Flash, for example a Cortex-Mx device. > Here I consider only western languages (left-to-right and european > chars, english, french, german, spanish and so on). > > The main problem is the translation of strings, maybe 10-100 strings. > > I know something about gettext package that can't be used in those > embeded platforms. However I like the approach of gettext. > > print_to_display(x, y, "Hello world!"); > > is simply changed in: > > #include <gettext.h> > ... > print_to_display(x, y, _("Hello world!")); > > In this way, the code stays highly readble as before introducing the > multi-language support. If a member of a structure needs a string, it is > a char * as usual. > > The solution I found in embedded platforms is to use an array of array > of strings: one index for the string and one index for the language. > > enum lang_t { ENGLISH, ITALIAN, LANG_N }; > enum string_t { STR_HELLO_WORLD, STR_HOW_ARE_YOU }; > > const char *strings[STRING_N][LANG_N] = { > { // STR_HELLO_WORLD > { "Hello world!", "Ciao mondo" } > }, > { // STR_HOW_ARE_YOU > { "How are you?", "Come stai?" } > }, > }; > > static enum lang_t lang = ENGLISH; > > const char *_(int string_idx) { > return strings[string_idx][lang]; > } > > void set_language(enum lang_t new_language) { > lang = new_language; > } > > > I don't like too much this approach for two reasons. > The first, the line: > > print_to_display(x, y, _(STR_HELLO_WORLD)); > > is much less readable than > > print_to_display(x, y, _("Hello world!")); > > > The second, I need to change the type of some members/variables from > char * to int: > > struct mystruct { > int title; // Instead of char *title > ... > }; > > > Another approach I'm thinking is to embed all the translations in the > string, using a separator character that can't be used in normal strings. > > print_to_display(x, y, _("Hello world!|Ciao Mondo!")); > > The _() function will search the translated string based on the current > language. If he can't find, it could return the first translation > (english). > > This approach has some disadvantages. It's difficult to exclude one > language from the build. If the languages are more than a couple, the > strings will be very long. The order of the translations (first english, > than italian, ...) is important and you have to remember it for every > string. > > > What approach do you use?
I think if you are looking for a pure C approach, and you want to keep it efficient, then using the enumerated type as an index is the best choice. But rather than writing all the strings directly in C, I would keep track of them in a spreadsheet saved in tab delimited format, and use a little script to turn it into a C header file declaring the enum, and a C source file initialising the array. It just makes it easier to keep track of everything, and saves a great deal of effort when you need to get someone else to make the translation strings.
In article <ooljr9$2kp$1@dont-email.me>, pozzugno@gmail.com says...
> > [This message is posted to comp.arch.embedded and comp.lang.c] > > Just for reference, an embedded platform based on a MCU with integrated > Flash, for example a Cortex-Mx device. > Here I consider only western languages (left-to-right and european > chars, english, french, german, spanish and so on).
Whatever method you choose there is one thing thta has to be done procedurally and will get overlooked if a time constrained bug fix occurs. That is that part of the fix is to change a string to correct an error, change the feature, whoever is updataing it,'forgets' or is time pressured for release and fails to to do ALL the other translations. There is no easy solution for that as that involves people.
> The main problem is the translation of strings, maybe 10-100 strings. > > I know something about gettext package that can't be used in those > embeded platforms. However I like the approach of gettext. > > print_to_display(x, y, "Hello world!"); > > is simply changed in: > > #include <gettext.h> > ... > print_to_display(x, y, _("Hello world!")); > > In this way, the code stays highly readble as before introducing the > multi-language support. If a member of a structure needs a string, it is > a char * as usual. > > The solution I found in embedded platforms is to use an array of array > of strings: one index for the string and one index for the language. > > enum lang_t { ENGLISH, ITALIAN, LANG_N }; > enum string_t { STR_HELLO_WORLD, STR_HOW_ARE_YOU }; > > const char *strings[STRING_N][LANG_N] = { > { // STR_HELLO_WORLD > { "Hello world!", "Ciao mondo" } > }, > { // STR_HOW_ARE_YOU > { "How are you?", "Come stai?" } > }, > };
The problem with having situations where the SAME string is in two places, I have seen fail on desktop applications. If the string in the translation tables is NOT identical to string in the code section, it fails. e.g. Table contains "hello world" Code contains "hello world\n" This also is more likely to happen where the same string has been copy/pasted as two different parts of code to actually print the same string. Then a correction is required, so someone diligently corrects the translation table and ONE place in the code where they see the problem, but does not realise there are OTHER instances of the same string. Whilst having index keys to strings (as Constants) may be less readable and could always have inline comments, it does save on storage space and iterative long string compares, if excecution speed is also a problem. Also cuts down on typos and other accidental differences between strings. Whatever you do need procedures to ensure all strings are translated to all languages, for every change of any string. The human part is the weak link. ....
> What approach do you use?
-- Paul Carpenter | paul@pcserviceselectronics.co.uk <http://www.pcserviceselectronics.co.uk/> PC Services <http://www.pcserviceselectronics.co.uk/LogicCell/> Logic Gate Education <http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font <http://www.badweb.org.uk/> For those web sites you hate
On Tue, 05 Sep 2017 10:43:24 +0200, David Brown
<david.brown@hesbynett.no> wrote:

>On 05/09/17 09:30, pozz wrote: >> [This message is posted to comp.arch.embedded and comp.lang.c] >> >> Just for reference, an embedded platform based on a MCU with integrated >> Flash, for example a Cortex-Mx device. >> Here I consider only western languages (left-to-right and european >> chars, english, french, german, spanish and so on). >> >> The main problem is the translation of strings, maybe 10-100 strings. >> >> I know something about gettext package that can't be used in those >> embeded platforms. However I like the approach of gettext. >> >> print_to_display(x, y, "Hello world!"); >> >> is simply changed in: >> >> #include <gettext.h> >> ... >> print_to_display(x, y, _("Hello world!")); >> >> In this way, the code stays highly readble as before introducing the >> multi-language support. If a member of a structure needs a string, it is >> a char * as usual. >> >> The solution I found in embedded platforms is to use an array of array >> of strings: one index for the string and one index for the language. >> >> enum lang_t { ENGLISH, ITALIAN, LANG_N }; >> enum string_t { STR_HELLO_WORLD, STR_HOW_ARE_YOU }; >> >> const char *strings[STRING_N][LANG_N] = { >> { // STR_HELLO_WORLD >> { "Hello world!", "Ciao mondo" } >> }, >> { // STR_HOW_ARE_YOU >> { "How are you?", "Come stai?" } >> }, >> }; >> >> static enum lang_t lang = ENGLISH; >> >> const char *_(int string_idx) { >> return strings[string_idx][lang]; >> } >> >> void set_language(enum lang_t new_language) { >> lang = new_language; >> } >> >> >> I don't like too much this approach for two reasons. >> The first, the line: >> >> print_to_display(x, y, _(STR_HELLO_WORLD)); >> >> is much less readable than >> >> print_to_display(x, y, _("Hello world!")); >> >> >> The second, I need to change the type of some members/variables from >> char * to int: >> >> struct mystruct { >> int title; // Instead of char *title >> ... >> }; >> >> >> Another approach I'm thinking is to embed all the translations in the >> string, using a separator character that can't be used in normal strings. >> >> print_to_display(x, y, _("Hello world!|Ciao Mondo!")); >> >> The _() function will search the translated string based on the current >> language. If he can't find, it could return the first translation >> (english). >> >> This approach has some disadvantages. It's difficult to exclude one >> language from the build. If the languages are more than a couple, the >> strings will be very long. The order of the translations (first english, >> than italian, ...) is important and you have to remember it for every >> string. >> >> >> What approach do you use? > >I think if you are looking for a pure C approach, and you want to keep >it efficient, then using the enumerated type as an index is the best choice. > >But rather than writing all the strings directly in C, I would keep >track of them in a spreadsheet saved in tab delimited format, and use a >little script to turn it into a C header file declaring the enum, and a >C source file initialising the array. It just makes it easier to keep >track of everything, and saves a great deal of effort when you need to >get someone else to make the translation strings.
We use a preprocessor approach as well, although not from a spreadsheet. An advantage of a preprocessor, is that it makes it easy to slot default messages (IOW English) in for items which have missing translations, and to build subsets of the supported languages and messages to keep space requirements down. Paul mentioned the difficulty of keeping the different translations in sync; the preprocessor can help there too, if you put a version code (a timestamp, in our case) on each version of each message. Then the preprocessor can warn if the English message was updated without the timestamp being updated (hopefully after being reviewed!) on the Italian message.
On 09/ 6/17 02:56 AM, Robert Wessel wrote:
> On Tue, 05 Sep 2017 10:43:24 +0200, David Brown > <david.brown@hesbynett.no> wrote:
>> I think if you are looking for a pure C approach, and you want to keep >> it efficient, then using the enumerated type as an index is the best choice. >> >> But rather than writing all the strings directly in C, I would keep >> track of them in a spreadsheet saved in tab delimited format, and use a >> little script to turn it into a C header file declaring the enum, and a >> C source file initialising the array. It just makes it easier to keep >> track of everything, and saves a great deal of effort when you need to >> get someone else to make the translation strings. > > > We use a preprocessor approach as well, although not from a > spreadsheet. An advantage of a preprocessor, is that it makes it easy > to slot default messages (IOW English) in for items which have missing > translations, and to build subsets of the supported languages and > messages to keep space requirements down. > > Paul mentioned the difficulty of keeping the different translations in > sync; the preprocessor can help there too, if you put a version code > (a timestamp, in our case) on each version of each message. Then the > preprocessor can warn if the English message was updated without the > timestamp being updated (hopefully after being reviewed!) on the > Italian message.
If you store your spreadsheet in plain text (SCV), your version control system can keep track of changes for you! -- Ian
On Wed, 6 Sep 2017 07:50:52 +1200, Ian Collins <ian-news@hotmail.com>
wrote:

>On 09/ 6/17 02:56 AM, Robert Wessel wrote: >> On Tue, 05 Sep 2017 10:43:24 +0200, David Brown >> <david.brown@hesbynett.no> wrote: > >>> I think if you are looking for a pure C approach, and you want to keep >>> it efficient, then using the enumerated type as an index is the best choice. >>> >>> But rather than writing all the strings directly in C, I would keep >>> track of them in a spreadsheet saved in tab delimited format, and use a >>> little script to turn it into a C header file declaring the enum, and a >>> C source file initialising the array. It just makes it easier to keep >>> track of everything, and saves a great deal of effort when you need to >>> get someone else to make the translation strings. >> >> >> We use a preprocessor approach as well, although not from a >> spreadsheet. An advantage of a preprocessor, is that it makes it easy >> to slot default messages (IOW English) in for items which have missing >> translations, and to build subsets of the supported languages and >> messages to keep space requirements down. >> >> Paul mentioned the difficulty of keeping the different translations in >> sync; the preprocessor can help there too, if you put a version code >> (a timestamp, in our case) on each version of each message. Then the >> preprocessor can warn if the English message was updated without the >> timestamp being updated (hopefully after being reviewed!) on the >> Italian message. > >If you store your spreadsheet in plain text (SCV), your version control >system can keep track of changes for you!
Well yes, (and ours are text based), but I've yet to see a SCM that can tell you that someone updated the English version of message#14, but hasn't validated or updated the Italian one yet. You can certainly get a diff and manually see where changes have been made, but that still leaves you with a manual comparison to the translations, and no way of tracking that the validations have been done.
Robert Wessel <robertwessel2@yahoo.com> writes:
> On Wed, 6 Sep 2017 07:50:52 +1200, Ian Collins <ian-news@hotmail.com> > wrote: >>On 09/ 6/17 02:56 AM, Robert Wessel wrote:
[...]
>>> Paul mentioned the difficulty of keeping the different translations in >>> sync; the preprocessor can help there too, if you put a version code >>> (a timestamp, in our case) on each version of each message. Then the >>> preprocessor can warn if the English message was updated without the >>> timestamp being updated (hopefully after being reviewed!) on the >>> Italian message. >> >>If you store your spreadsheet in plain text (SCV), your version control >>system can keep track of changes for you! > > > Well yes, (and ours are text based), but I've yet to see a SCM that > can tell you that someone updated the English version of message#14, > but hasn't validated or updated the Italian one yet. You can > certainly get a diff and manually see where changes have been made, > but that still leaves you with a manual comparison to the > translations, and no way of tracking that the validations have been > done.
The "git blame" command tells you, for each line in a file, when that line was most recently modified. Other SCMs have similar tools. I imagine you could build some tools on top of that that could warn you, for example, that the English version of message #14 was updated yesterday but the Italian version hasn't been changed in the last year. -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Working, but not speaking, for JetHead Development, Inc. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"
On Tue, 05 Sep 2017 15:17:56 -0700, Keith Thompson <kst-u@mib.org>
wrote:

>Robert Wessel <robertwessel2@yahoo.com> writes: >> On Wed, 6 Sep 2017 07:50:52 +1200, Ian Collins <ian-news@hotmail.com> >> wrote: >>>On 09/ 6/17 02:56 AM, Robert Wessel wrote: >[...] >>>> Paul mentioned the difficulty of keeping the different translations in >>>> sync; the preprocessor can help there too, if you put a version code >>>> (a timestamp, in our case) on each version of each message. Then the >>>> preprocessor can warn if the English message was updated without the >>>> timestamp being updated (hopefully after being reviewed!) on the >>>> Italian message. >>> >>>If you store your spreadsheet in plain text (SCV), your version control >>>system can keep track of changes for you! >> >> >> Well yes, (and ours are text based), but I've yet to see a SCM that >> can tell you that someone updated the English version of message#14, >> but hasn't validated or updated the Italian one yet. You can >> certainly get a diff and manually see where changes have been made, >> but that still leaves you with a manual comparison to the >> translations, and no way of tracking that the validations have been >> done. > >The "git blame" command tells you, for each line in a file, when that >line was most recently modified. Other SCMs have similar tools. >I imagine you could build some tools on top of that that could >warn you, for example, that the English version of message #14 was >updated yesterday but the Italian version hasn't been changed in >the last year.
A problem is when a change needs to be made to only some translations of a message. Let's say the English one was awkwardly worded, and thus modified, but (some of) the other translations don't need to be changed (although they should probably be reviewed). You need a way to track when a particular translation was last validated against the intended meaning of the message (OK, let's be blunt, the base English message), and against which version it was validated. So we have: { msg=FILENOTFOUND,v=3 EN="File not found",m=05-09-2017,v=3 IT="File non trovato",m=01-01-2015,v=3 }
In comp.arch.embedded Paul <paul@pcserviceselectronics.co.uk> wrote:
> Whilst having index keys to strings (as Constants) may be less readable > and could always have inline comments, it does save on storage space > and iterative long string compares, if excecution speed is also a > problem. Also cuts down on typos and other accidental differences > between strings.
I wonder if you could do something using the strings themselves as identifiers. In other words _("Hello world") is a function called _() passed a const char * The first thing the _() function does is look up that char * in a hash table to see if it's something we've seen before. If so, it returns a pointer to the translated string. If not, it matches the string against a list of translations and inserts the pointer to the translation into the hash table. The tradeoff is that it's more work at runtime. But essentially we only have to walk the string once per run, and then all we have to do is hash the pointer each time we use it. That's not zero overhead, but probably much less work than printf() is already doing (if you're using that). It's a bit more problematic if first-time walking the string might be too costly on some code paths.
> Whatever you do need procedures to ensure all strings are translated > to all languages, for every change of any string. The human part is the > weak link.
gettext or another compiler technique could be used to scrape out the strings to build the translation table. You might be able to instrument that to raise an error at compile time when the extracted translations don't match the translations in the database. Theo
On Tue, 05 Sep 2017 17:11:40 -0500, Robert Wessel wrote:

> On Wed, 6 Sep 2017 07:50:52 +1200, Ian Collins <ian-news@hotmail.com> > wrote: > >>On 09/ 6/17 02:56 AM, Robert Wessel wrote: >>> On Tue, 05 Sep 2017 10:43:24 +0200, David Brown >>> <david.brown@hesbynett.no> wrote: >> >>>> I think if you are looking for a pure C approach, and you want to >>>> keep it efficient, then using the enumerated type as an index is the >>>> best choice. >>>> >>>> But rather than writing all the strings directly in C, I would keep >>>> track of them in a spreadsheet saved in tab delimited format, and use >>>> a little script to turn it into a C header file declaring the enum, >>>> and a C source file initialising the array. It just makes it easier >>>> to keep track of everything, and saves a great deal of effort when >>>> you need to get someone else to make the translation strings. >>> >>> >>> We use a preprocessor approach as well, although not from a >>> spreadsheet. An advantage of a preprocessor, is that it makes it easy >>> to slot default messages (IOW English) in for items which have missing >>> translations, and to build subsets of the supported languages and >>> messages to keep space requirements down. >>> >>> Paul mentioned the difficulty of keeping the different translations in >>> sync; the preprocessor can help there too, if you put a version code >>> (a timestamp, in our case) on each version of each message. Then the >>> preprocessor can warn if the English message was updated without the >>> timestamp being updated (hopefully after being reviewed!) on the >>> Italian message. >> >>If you store your spreadsheet in plain text (SCV), your version control >>system can keep track of changes for you! > > > Well yes, (and ours are text based), but I've yet to see a SCM that can > tell you that someone updated the English version of message#14, but > hasn't validated or updated the Italian one yet. You can certainly get > a diff and manually see where changes have been made, but that still > leaves you with a manual comparison to the translations, and no way of > tracking that the validations have been done.
Some SCM tools allow the use of "hook scripts" - bits of code or programs that you can hook into various SCM actions or states. You would need to write the scripts yourself, but it ought to be possible to disallow a checkin on the English file if the other language files haven't been updated. I did a similar thing with Tortoise SVN once: for some reason we had document source (e.g. Word) and PDF output both in the SCM system, and a hook script would check if one was being checked in without the other, and abort the checkin with a suitable message if one was missing. Regards, Allan