EmbeddedRelated.com
Forums

Attention: European C/C++/C#/Java Programmers-Call for Input

Started by Paul K. McKneely January 27, 2009
>>It wouldn't be much of a shock for programmers to suddenly be able >>to use [Greek letter pi] instead of pi. > > As you can in Ada? > <http://www.adaic.com/standards/05rm/html/RM-A-5.html> > > About the character set usable in Ada, > see the following section in the Ada 2005 Reference Manual > <http://www.adaic.com/standards/05rm/html/RM-2-1.html> > and the corresponding section in the Ada 2005 Rationale > <http://www.adaic.com/standards/05rat/html/Rat-7-5.html>. > > Hope this helps, > > Dirk
Yes, Dirk. That helps. Thanks.
> Did you miss the key point? *UNICODE*. They very specifically choose a > *standard* for their encodings, not something incompatible and > proprietary. In particular, it's very useful to be able to write comments > and strings in Unicode - many modern languages allow it. If you had > suggested using Unicode, or Latin-1, or listened to the idea when it was > suggested, then you'd have got far more support - it's the idea of have a > proprietary half-baked encoding that is incompatible with every other tool > that is "incredibly stupid".
My fault for phrasing my original question badly. I should never have mentioned the words "character set". Forget that there is an internal encoding method that is used in the compiler tools for this new language whose codes will never be seen by its users. The programming lanugage supports only a subset of the complete UNICODE character set regarding the Western European alphabetics. The language only recognizes a maximum of 254 alphanumerics (Basic Greek and Cyrillic are included) for variable names etc. including the underscore which is regarded as alphabetic but ordinally precedes all others. If Western European programmers had to choose a subset of these for language support, which ones would they be? But I gather now that European programmers, for the most part, don't care because these localized characters wouldn't be used in their programming anyway because of the inter-operability problems that arise when they are applied to source code. Since the programmers I speak of are not interested in them, but space has been allocated for many of them, I can take the huge tome of UNICODE characters and make the choices myself, a na&#4294967295;ve American :) But I will also consider other subsets (some of which have been suggested by helpful posters) in the process of making my final decision. Thank you (really) for your input. Paul
On Thu, 29 Jan 2009 09:28:09 -0600, "Paul K. McKneely"
<pkmckneely@sbcglobal.net> wrote:

> >> Did you miss the key point? *UNICODE*. They very specifically choose a >> *standard* for their encodings, not something incompatible and >> proprietary. In particular, it's very useful to be able to write comments >> and strings in Unicode - many modern languages allow it. If you had >> suggested using Unicode, or Latin-1, or listened to the idea when it was >> suggested, then you'd have got far more support - it's the idea of have a >> proprietary half-baked encoding that is incompatible with every other tool >> that is "incredibly stupid". > >My fault for phrasing my original question badly. I should >never have mentioned the words "character set". Forget that >there is an internal encoding method that is used in the compiler >tools for this new language whose codes will never be seen by its users. >The programming lanugage supports only a subset of the complete >UNICODE character set regarding the Western European >alphabetics. The language only recognizes a maximum of 254 >alphanumerics (Basic Greek and Cyrillic are included) for variable >names etc. including the underscore which is regarded as alphabetic >but ordinally precedes all others. If Western European >programmers had to choose a subset of these for language >support, which ones would they be?
I still do not understand why you want to use some own internal representation instead of e,g. UTF-8. For any language using a Latin script for identifiers, the effective string length is 1.0x or rare cases 1.1x times the length of the identifier. For Cyrillic or Greek, the ratio is 2.0. So the extra memory consumption e.g. in compiler symbol tables are negligible. Regarding linkers, UTF-8 global symbol names should not be a problem, unless the object language uses the 8th bit for some kind of signaling (such as end of string) or otherwise limits the valid bit combinations. Of course the UTF-8 encoding may increase the identifier length, but at least for a linker that usually examines only a specific number of bytes, such as 32, the only risk is that two identifiers are not unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10 graphs in some East-Asian script. Paul
> I still do not understand why you want to use some own internal > representation instead of e,g. UTF-8. For any language using a Latin > script for identifiers, the effective string length is 1.0x or rare > cases 1.1x times the length of the identifier. For Cyrillic or Greek, > the ratio is 2.0.
Simply encoding a kazillion different characters is not the whole picture. As Boudewijn Dijkstra pointed out, trying to alphabetize all of the potential UNICODE variables is impossible. (Those are his words, not mine and the ramifications go far beyond just this issue). So how do you alphabetize, search and list on an unwieldy character set for many purposes such as showing a listing to the programmer in his tool chain? That is not to mention that 21-bits (or 32-bits) are already used up in just the character's code. The new programming language supports fonts, color (foreground and background), attributes, size etc. Do you think it is a good idea to have to expand these basic character codes to 64/ 96/128 or even 256 bits in width just to cram it all in? The web people would want to encode it all in ASCII HTML-style tags which I think is a really bad idea. The overwhelming consensus among responders to these threads have voiced that they are not going to use anything beyond ASCII anyway. And with all of this text stuff, you haven't even begun to talk about how you are going to achieve all of the very advanced (and very difficult) stuff in the programming language, (much of which hasn't ever been done before) while carrying this huge load of excess baggage on your back. I needed to define some additional characters that weren't in ASCII (and aren't in UNICODE) for the purposes of the programming language (which predates UNICODE and UTF-8 BTW) Additional characters in APL being sited as the downfall for that language is not well founded in light of the fact that, when it came out, you had to put out a couple of thousand dollars for a hard-wired specialized terminal just to program in that language. That is besides the fact that it was not designed for the kinds of things that I want to do with it (such as writing operating systems and device drivers) Do you see my point(s)? Simple, lean and mean, but more powerful than anything we have now. That is what I am shooting for. When symbols need to be converted to whatever format when object files are produced, that's where the necessary conversions will be done. This will keep the core of the tools much simpler (and smaller and run faster) so that the whole project won't collapse when I try to do the really difficult things that were the primary goals that I started out to accomplish in the first place.
> So the extra memory consumption e.g. in compiler symbol tables are > negligible. > > Regarding linkers, UTF-8 global symbol names should not be a problem, > unless the object language uses the 8th bit for some kind of signaling > (such as end of string) or otherwise limits the valid bit > combinations. > > Of course the UTF-8 encoding may increase the identifier length, but > at least for a linker that usually examines only a specific number of > bytes, such as 32, the only risk is that two identifiers are not > unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10 > graphs in some East-Asian script. > > Paul
I do want you to know that I do very much appreciate your input. This issue about object formats supporting UNICODE is going to be a real help when it comes time to generating machine code.
Paul K. McKneely wrote:

> Simply encoding a kazillion different characters > is not the whole picture. As Boudewijn Dijkstra > pointed out, trying to alphabetize all of the potential > UNICODE variables is impossible. (Those are his > words, not mine and the ramifications go far beyond > just this issue). So how do you alphabetize, search > and list on an unwieldy character set for many > purposes such as showing a listing to the programmer > in his tool chain? That is not to mention that 21-bits > (or 32-bits) are already used up in just the character's > code. The new programming language supports fonts, > color (foreground and background), attributes, size etc. > Do you think it is a good idea > to have to expand these basic character codes to > 64/ 96/128 or even 256 bits in width just to cram it all in?
If you want more colors, font sizes etc., one idea might be to use something like TeX, e.g. like it is possible with literate programming. An example how it looks like: http://www.literateprogramming.com/adventure.pdf Simpler to type might be a more formal language, e.g. like Fortess: http://projectfortress.sun.com/Projects/Community/wiki/MathSyntaxInFortress
> Simple, lean and mean, but more powerful > than anything we have now. That is what I am > shooting for. When symbols need to be > converted to whatever format when object > files are produced, that's where the necessary > conversions will be done. > This will keep the core of the tools much simpler > (and smaller and run faster) so that the whole project > won't collapse when I try to do the really difficult > things that were the primary goals that I started > out to accomplish in the first place.
This sounds interesting, can you say more about your ideas? Maybe would be nice for some programmers, if you can use all unicode characters for identifiers or comments, but the more important part is the architecture of the language. -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Hi there,

Paul K. McKneely wrote:
> Simply encoding a kazillion different characters > is not the whole picture. As Boudewijn Dijkstra > pointed out, trying to alphabetize all of the potential > UNICODE variables is impossible. (Those are his > words, not mine and the ramifications go far beyond > just this issue). So how do you alphabetize, search > and list on an unwieldy character set for many > purposes such as showing a listing to the programmer > in his tool chain?
Showing a listing: you just need a font that has all the characters. When I need to look up something in Unicode, I start by opening charmap.exe and selecting Lucida Sans Unicode. Sorting? 100% correct sorting and case-folding is locale-dependant anyway. So you sort the locale characters in a locale-dependant way, and the others by their Unicode number. This usually gives a sensible result. Alternatively, invent some kind of sort (such as "sort all accented characters after their base characters"). And if you're only sorting them for your internal symbol management, which users don't ever get to see, use Unicode numbers. Case-folding? Case-folding tables are very well compressible. I use a table of with about a dozen entries of the form struct { uint16_t FirstLowercaseCharacter; uint16_t FirstUppercaseCharacter; uint16_t NumberOfCharacters; uint16_t DistanceOfCharacters; } to case-fold a repertoire of I think over thousand characters. Entries are things like { 0x61, 0x41, 26, 1 } for ASCII, something like { 0x100, 0x101, 33, 2 } for the first half of Latin Extended A. In total, much less data than a case-folding table for DOS codepage 437. What'll make it really complicated is composing/decomposing characters from their accents and the base character...
> That is not to mention that 21-bits > (or 32-bits) are already used up in just the character's > code. The new programming language supports fonts, > color (foreground and background), attributes, size etc. > Do you think it is a good idea > to have to expand these basic character codes to > 64/ 96/128 or even 256 bits in width just to cram it all in?
Depends on what you want to achieve, and at what point you'd manipulate these attributes. Using control characters ("escape sequences") would be one sensible approach. Using extents (an additional data item added to the string that say "characters 20 to 30 are red") is another. Both also give the possibility to add parameters to your attributes, such as font names or link targets. I use both approaches regularily.
> I needed to define some additional > characters that weren't in ASCII (and aren't in UNICODE)
Sure? The Unicode book is thick :-) Stefan
Op Thu, 29 Jan 2009 18:21:04 +0100 schreef Paul K. McKneely  
<pkmckneely@sbcglobal.net>:
> So how do you alphabetize, search > and list on an unwieldy character set for many > purposes such as showing a listing to the programmer > in his tool chain? That is not to mention that 21-bits > (or 32-bits) are already used up in just the character's > code. The new programming language supports fonts, > color (foreground and background), attributes, size etc. > Do you think it is a good idea > to have to expand these basic character codes to > 64/ 96/128 or even 256 bits in width just to cram it all in?
If you are going to encode all this formatting information on a per-character basis, you are going to have a lot of redundant information, which would make compression a given. Then why not go all the way and encode a 32-bit Unicode character, 24-bit foreground and background, etc. in 128 or 256 bits?
> The web people would want to encode it all in ASCII > HTML-style tags which I think is a really bad idea.
Why? Most office suites have a decent HTML export functionality. Various HTML-editors are available. HTML and XML are not only popular in web-type applications.
> I needed to define some additional > characters that weren't in ASCII (and aren't in UNICODE) > for the purposes of the programming language (which > predates UNICODE and UTF-8 BTW)
I am curious to know which language that was, and which characters they are. -- Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/
On Thu, 29 Jan 2009 09:28:09 -0600, "Paul K. McKneely"
<pkmckneely@sbcglobal.net> wrote:

>My fault for phrasing my original question badly. I should >never have mentioned the words "character set". Forget that >there is an internal encoding method that is used in the compiler >tools for this new language whose codes will never be seen by its users. >The programming lanugage supports only a subset of the complete >UNICODE character set regarding the Western European >alphabetics.
There's probably more to be gained in the long term by sticking with a current standard of encoding. I say this because the real internationalisation issues are not in the character set, but in translation and display. Western Europe is the least of your problems, without even considering right-to-left display. When you internationalise an application, even an embedded one, a standard process is to send your text to be translated from English (7 bit ASCII plus a few specials) to your dealer, who translates the messages into his/her language and sends it back to you. Because you're writing a program for humans to use, you include things like dates, times, and currency. These all vary in format across the world. In addition, parameter order will vary in different spoken languages. On a PC consider what happens when a program written in English by South Africans (three languages in daily use in the office), is run in Hong Kong on a PC with a Chinese operating system but for use by a Russian engineer who wants his package to display Cyrillic (several encodings available). This scenario has been seen in the wild. One customer of ours supports 17 different spoken languages in multiple encodings. For most embedded systems it's not as extreme as that, especially as there's usually no operating system. Despite this, having to support several languages, even within the same country, is normal. We still have to support varying display orders in embedded systems. Stephen -- Stephen Pelc, stephenXXX@mpeforth.com MicroProcessor Engineering Ltd - More Real, Less Time 133 Hill Lane, Southampton SO15 5AF, England tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691 web: http://www.mpeforth.com - free VFX Forth downloads
Paul K. McKneely wrote:
>> I still do not understand why you want to use some own internal >> representation instead of e,g. UTF-8. For any language using a Latin >> script for identifiers, the effective string length is 1.0x or rare >> cases 1.1x times the length of the identifier. For Cyrillic or Greek, >> the ratio is 2.0. >
I would suggest you start by giving up on all your thoughts of specific character sets. Simply make a straight decision now - you will use UTF-8. No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. Take it as a fixed decision and work with it for a few days to see how it fits your needs. Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. If you really put in this effort and find that UTF-8 does not fit your needs, what have you lost? A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. You might even be able to explain it to other people in a way that makes sense.
> Simply encoding a kazillion different characters > is not the whole picture. As Boudewijn Dijkstra > pointed out, trying to alphabetize all of the potential > UNICODE variables is impossible. (Those are his > words, not mine and the ramifications go far beyond > just this issue). So how do you alphabetize, search > and list on an unwieldy character set for many > purposes such as showing a listing to the programmer
If you need to alphabetize, there should be no shortage of existing library routines for sorting in UTF-8. It's not easy - differences in locales can cause endless troubles, so you might not get a perfect solution. But you'll find something that does a reasonable job and *will* work perfectly for most programmers who stick to ASCII identifiers. A related problem is if you are making identifiers case-insensitive - it's hard to figure out cases for non-ASCII characters. So stick to case-sensitive identifiers.
> in his tool chain? That is not to mention that 21-bits > (or 32-bits) are already used up in just the character's > code.
I have no clue as to what you are talking about here.
> The new programming language supports fonts, > color (foreground and background), attributes, size etc. > Do you think it is a good idea > to have to expand these basic character codes to > 64/ 96/128 or even 256 bits in width just to cram it all in? > The web people would want to encode it all in ASCII > HTML-style tags which I think is a really bad idea.
Are you suggesting that you are including font, colour, etc., directly in the source code? And here was me thinking that a proprietary character encoding was an "amazingly stupid idea".
> The overwhelming consensus among responders to these > threads have voiced that they are not going to use > anything beyond ASCII anyway. And with all of > this text stuff, you haven't even begun to talk about > how you are going to achieve all of the very advanced > (and very difficult) stuff in the programming language, > (much of which hasn't ever been done before) > while carrying this huge load of excess baggage
Who is "you" who are going to achieve all this? Do you mean the developers of the tools (i.e., you and your colleagues), or do you mean your users? And if it is us potential users, what is this "very advanced stuff" you are talking about? If we knew the specific aims of your language - what it is that makes it better than existing alternatives - it would be easier to advise you.
> on your back. I needed to define some additional > characters that weren't in ASCII (and aren't in UNICODE) > for the purposes of the programming language (which > predates UNICODE and UTF-8 BTW) Additional
First off, you do *not* need to define additional characters. It's conceivable that your tools might *benefit* from additional characters (although, as I said, we know nothing about your tools). But they don't *need* them. Secondly, Unicode has openings for additional domain-specific characters - you can add them without losing all the other benefits of Unicode (of course, you'll have to provide a suitable font).
> characters in APL being sited as the downfall for that > language is not well founded in light of the fact that, > when it came out, you had to put out a couple of > thousand dollars for a hard-wired specialized > terminal just to program in that language. That is > besides the fact that it was not designed for the > kinds of things that I want to do with it (such as > writing operating systems and device drivers) > Do you see my point(s)? >
No, I don't see your point at all. It reads as though you are saying APL's lack of popularity was not that it had extra characters, but that it needed an expensive specialised terminal (which was solely because of its special characters). The main reason for APL's lack of popularity *is* the special characters. Even though you don't need special hardware (you use a specialised keyboard map and extra fonts), the characters make it impossible to read and understand for the non-expert, and extremely slow to enter expressions. It is *vastly* easier to write for example "range(R)" than "&iota;R" because you don't have to find the special character. It is also *vastly* easier to read and pronounce, and to understand "range(R)" than "&iota;R" even if you have never used the language in question (Python). To take an example from wikipedia's APL page, here is an expression to give a list of prime numbers up to R: (&sim;R&isin;R&deg;.&times;R)/R&larr;1&darr;&iota;R The direct Python translation would be: [p for p in range(2, R+1) if not p in [x*y for x in range(2, R+1) for y in range(2, R+1)]] The APL version is certainly shorter - but nevertheless is slower and harder to write. APL's power and conciseness comes from the power of its built-in functions, not the fact that most have a single weird symbol instead of a multi-character name.
> Simple, lean and mean, but more powerful > than anything we have now. That is what I am > shooting for. When symbols need to be > converted to whatever format when object > files are produced, that's where the necessary > conversions will be done. > This will keep the core of the tools much simpler > (and smaller and run faster) so that the whole project > won't collapse when I try to do the really difficult > things that were the primary goals that I started > out to accomplish in the first place. > >> So the extra memory consumption e.g. in compiler symbol tables are >> negligible. >> >> Regarding linkers, UTF-8 global symbol names should not be a problem, >> unless the object language uses the 8th bit for some kind of signaling >> (such as end of string) or otherwise limits the valid bit >> combinations. >> >> Of course the UTF-8 encoding may increase the identifier length, but >> at least for a linker that usually examines only a specific number of >> bytes, such as 32, the only risk is that two identifiers are not >> unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10 >> graphs in some East-Asian script. >> >> Paul > > > I do want you to know that I do very much > appreciate your input. This issue about object > formats supporting UNICODE is going to be > a real help when it comes time to generating > machine code. > >
"Stephen Pelc" <stephenXXX@mpeforth.com> wrote in message 
news:498200b7.620856320@192.168.0.50...
> On a PC consider what happens when a program written in English > by South Africans (three languages in daily use in the office),
Oh really? Where in RSA and what languages? (English, Afrikaans & isiXhosa/isiZulu/Setwana...?) My wife and I were there in September-October for almost 3 weeks. Did a loop from Capetown to Calvinia, Beaufort West, down to Oudtshoorn, Knysna and back.