Attention: European C/C++/C#/Java Programmers-Call for Input| page 4

Reply by Paul K. McKneely ●January 29, 20092009-01-29

>>It wouldn't be much of a shock for programmers to suddenly be able
>>to use [Greek letter pi] instead of pi.
>
> As you can in Ada?
> <http://www.adaic.com/standards/05rm/html/RM-A-5.html>
>
> About the character set usable in Ada,
> see the following section in the Ada 2005 Reference Manual
> <http://www.adaic.com/standards/05rm/html/RM-2-1.html>
> and the corresponding section in the Ada 2005 Rationale
> <http://www.adaic.com/standards/05rat/html/Rat-7-5.html>.
>
> Hope this helps,
>
> Dirk

Yes, Dirk.  That helps.  Thanks.

Reply by Paul K. McKneely ●January 29, 20092009-01-29

> Did you miss the key point?  *UNICODE*.  They very specifically choose a
> *standard* for their encodings, not something incompatible and 
> proprietary.  In particular, it's very useful to be able to write comments 
> and strings in Unicode - many modern languages allow it.  If you had 
> suggested using Unicode, or Latin-1, or listened to the idea when it was 
> suggested, then you'd have got far more support - it's the idea of have a 
> proprietary half-baked encoding that is incompatible with every other tool 
> that is "incredibly stupid".

My fault for phrasing my original question badly.  I should
never have mentioned the words "character set".  Forget that
there is an internal encoding method that is used in the compiler
tools for this new language whose codes will never be seen by its users.
The programming lanugage supports only a subset of the complete
UNICODE character set regarding the Western European
alphabetics.  The language only recognizes a maximum of 254
alphanumerics (Basic Greek and Cyrillic are included) for variable
names etc. including the underscore which is regarded as alphabetic
but ordinally precedes all others.  If Western European
programmers had to choose a subset of these for language
support, which ones would they be?

But I gather now that European programmers, for the most part,
don't care because these localized characters wouldn't be used in
their programming anyway because of the inter-operability
problems that arise when they are applied to source code.  Since
the programmers I speak of are not interested in them, but space
has been allocated for many of them, I can take the huge tome of
UNICODE characters and make the choices myself, a na&#4294967295;ve American :)
But I will also consider other subsets (some of which have been suggested
by helpful posters) in the process of making my final decision.

Thank you (really) for your input.

Paul

Reply by Paul Keinanen ●January 29, 20092009-01-29

On Thu, 29 Jan 2009 09:28:09 -0600, "Paul K. McKneely"
<pkmckneely@sbcglobal.net> wrote:

>
>> Did you miss the key point?  *UNICODE*.  They very specifically choose a
>> *standard* for their encodings, not something incompatible and 
>> proprietary.  In particular, it's very useful to be able to write comments 
>> and strings in Unicode - many modern languages allow it.  If you had 
>> suggested using Unicode, or Latin-1, or listened to the idea when it was 
>> suggested, then you'd have got far more support - it's the idea of have a 
>> proprietary half-baked encoding that is incompatible with every other tool 
>> that is "incredibly stupid".
>
>My fault for phrasing my original question badly.  I should
>never have mentioned the words "character set".  Forget that
>there is an internal encoding method that is used in the compiler
>tools for this new language whose codes will never be seen by its users.
>The programming lanugage supports only a subset of the complete
>UNICODE character set regarding the Western European
>alphabetics.  The language only recognizes a maximum of 254
>alphanumerics (Basic Greek and Cyrillic are included) for variable
>names etc. including the underscore which is regarded as alphabetic
>but ordinally precedes all others.  If Western European
>programmers had to choose a subset of these for language
>support, which ones would they be?

I still do not understand why you want to use some own internal
representation instead of e,g. UTF-8. For any language using a Latin
script for identifiers, the effective string length is 1.0x or rare
cases 1.1x times the length of the identifier. For Cyrillic or Greek,
the ratio is 2.0.

So the extra memory consumption e.g. in compiler symbol tables are
negligible.

Regarding linkers, UTF-8 global symbol names should not be a problem,
unless the object language uses the 8th bit for some kind of signaling
(such as end of string) or otherwise limits the valid bit
combinations. 

Of course the UTF-8 encoding may increase the identifier length, but
at least for a linker that usually examines only a specific number  of
bytes, such as 32, the only risk is that two identifiers are not
unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10
graphs in some East-Asian script.

Paul

Reply by Paul K. McKneely ●January 29, 20092009-01-29

> I still do not understand why you want to use some own internal
> representation instead of e,g. UTF-8. For any language using a Latin
> script for identifiers, the effective string length is 1.0x or rare
> cases 1.1x times the length of the identifier. For Cyrillic or Greek,
> the ratio is 2.0.

Simply encoding a kazillion different characters
is not the whole picture.  As Boudewijn Dijkstra
pointed out, trying to alphabetize all of the potential
UNICODE variables is impossible.  (Those are his
words, not mine and the ramifications go far beyond
just this issue).  So how do you alphabetize, search
and list on an unwieldy character set for many
purposes such as showing a listing to the programmer
in his tool chain?  That is not to mention that 21-bits
(or 32-bits) are already used up in just the character's
code.  The new programming language supports fonts,
color (foreground and background), attributes, size etc.
Do you think it is a good idea
to have to expand these basic character codes to
64/ 96/128 or even 256 bits in width just to cram it all in?
The web people would want to encode it all in ASCII
HTML-style tags which I think is a really bad idea.
The overwhelming consensus among responders to these
threads have voiced that they are not going to use
anything beyond ASCII anyway.  And with all of
this text stuff, you haven't even begun to talk about
how you are going to achieve all of the very advanced
(and very difficult) stuff in the programming language,
(much of which hasn't ever been done before)
while carrying this huge load of excess baggage
on your back.  I needed to define some additional
characters that weren't in ASCII (and aren't in UNICODE)
for the purposes of the programming language (which
predates UNICODE and UTF-8 BTW) Additional
characters in APL being sited as the downfall for that
language is not well founded in light of the fact that,
when it came out, you had to put out a couple of
thousand dollars for a hard-wired specialized
terminal just to program in that language.  That is
besides the fact that it was not designed for the
kinds of things that I want to do with it (such as
writing operating systems and device drivers)
Do you see my point(s)?

Simple, lean and mean,  but more powerful
than anything we have now.  That is what I am
shooting for.  When symbols need to be
converted to whatever format when object
files are produced, that's where the necessary
conversions will be done.
This will keep the core of the tools much simpler
(and smaller and run faster) so that the whole project
won't collapse when I try to do the really difficult
things that were the primary goals that I started
out to accomplish in the first place.

> So the extra memory consumption e.g. in compiler symbol tables are
> negligible.
>
> Regarding linkers, UTF-8 global symbol names should not be a problem,
> unless the object language uses the 8th bit for some kind of signaling
> (such as end of string) or otherwise limits the valid bit
> combinations.
>
> Of course the UTF-8 encoding may increase the identifier length, but
> at least for a linker that usually examines only a specific number  of
> bytes, such as 32, the only risk is that two identifiers are not
> unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10
> graphs in some East-Asian script.
>
> Paul


I do want you to know that I do very much
appreciate your input.  This issue about object
formats supporting UNICODE is going to be
a real help when it comes time to generating
machine code.

Reply by Frank Buss ●January 29, 20092009-01-29

Paul K. McKneely wrote:

> Simply encoding a kazillion different characters
> is not the whole picture.  As Boudewijn Dijkstra
> pointed out, trying to alphabetize all of the potential
> UNICODE variables is impossible.  (Those are his
> words, not mine and the ramifications go far beyond
> just this issue).  So how do you alphabetize, search
> and list on an unwieldy character set for many
> purposes such as showing a listing to the programmer
> in his tool chain?  That is not to mention that 21-bits
> (or 32-bits) are already used up in just the character's
> code.  The new programming language supports fonts,
> color (foreground and background), attributes, size etc.
> Do you think it is a good idea
> to have to expand these basic character codes to
> 64/ 96/128 or even 256 bits in width just to cram it all in?

If you want more colors, font sizes etc., one idea might be to use
something like TeX, e.g. like it is possible with literate programming. An
example how it looks like:

http://www.literateprogramming.com/adventure.pdf

Simpler to type might be a more formal language, e.g. like Fortess:

http://projectfortress.sun.com/Projects/Community/wiki/MathSyntaxInFortress

> Simple, lean and mean,  but more powerful
> than anything we have now.  That is what I am
> shooting for.  When symbols need to be
> converted to whatever format when object
> files are produced, that's where the necessary
> conversions will be done.
> This will keep the core of the tools much simpler
> (and smaller and run faster) so that the whole project
> won't collapse when I try to do the really difficult
> things that were the primary goals that I started
> out to accomplish in the first place.

This sounds interesting, can you say more about your ideas? Maybe would be
nice for some programmers, if you can use all unicode characters for
identifiers or comments, but the more important part is the architecture of
the language.

-- 
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

Reply by Stefan Reuther ●January 29, 20092009-01-29

Hi there,

Paul K. McKneely wrote:
> Simply encoding a kazillion different characters
> is not the whole picture.  As Boudewijn Dijkstra
> pointed out, trying to alphabetize all of the potential
> UNICODE variables is impossible.  (Those are his
> words, not mine and the ramifications go far beyond
> just this issue).  So how do you alphabetize, search
> and list on an unwieldy character set for many
> purposes such as showing a listing to the programmer
> in his tool chain?

Showing a listing: you just need a font that has all the characters.
When I need to look up something in Unicode, I start by opening
charmap.exe and selecting Lucida Sans Unicode.

Sorting? 100% correct sorting and case-folding is locale-dependant
anyway. So you sort the locale characters in a locale-dependant way, and
the others by their Unicode number. This usually gives a sensible
result. Alternatively, invent some kind of sort (such as "sort all
accented characters after their base characters"). And if you're only
sorting them for your internal symbol management, which users don't ever
get to see, use Unicode numbers.

Case-folding? Case-folding tables are very well compressible. I use a
table of with about a dozen entries of the form
     struct {
        uint16_t FirstLowercaseCharacter;
        uint16_t FirstUppercaseCharacter;
        uint16_t NumberOfCharacters;
        uint16_t DistanceOfCharacters;
     }
to case-fold a repertoire of I think over thousand characters. Entries
are things like { 0x61, 0x41, 26, 1 } for ASCII, something like { 0x100,
0x101, 33, 2 } for the first half of Latin Extended A. In total, much
less data than a case-folding table for DOS codepage 437.

What'll make it really complicated is composing/decomposing characters
from their accents and the base character...

> That is not to mention that 21-bits
> (or 32-bits) are already used up in just the character's
> code.  The new programming language supports fonts,
> color (foreground and background), attributes, size etc.
> Do you think it is a good idea
> to have to expand these basic character codes to
> 64/ 96/128 or even 256 bits in width just to cram it all in?

Depends on what you want to achieve, and at what point you'd manipulate
these attributes. Using control characters ("escape sequences") would be
one sensible approach. Using extents (an additional data item added to
the string that say "characters 20 to 30 are red") is another. Both also
give the possibility to add parameters to your attributes, such as font
names or link targets. I use both approaches regularily.

> I needed to define some additional
> characters that weren't in ASCII (and aren't in UNICODE)

Sure? The Unicode book is thick :-)

  Stefan

Reply by Boudewijn Dijkstra ●January 29, 20092009-01-29

Op Thu, 29 Jan 2009 18:21:04 +0100 schreef Paul K. McKneely  
<pkmckneely@sbcglobal.net>:
> So how do you alphabetize, search
> and list on an unwieldy character set for many
> purposes such as showing a listing to the programmer
> in his tool chain?  That is not to mention that 21-bits
> (or 32-bits) are already used up in just the character's
> code.  The new programming language supports fonts,
> color (foreground and background), attributes, size etc.
> Do you think it is a good idea
> to have to expand these basic character codes to
> 64/ 96/128 or even 256 bits in width just to cram it all in?

If you are going to encode all this formatting information on a  
per-character basis, you are going to have a lot of redundant information,  
which would make compression a given.  Then why not go all the way and  
encode a 32-bit Unicode character, 24-bit foreground and background, etc.  
in 128 or 256 bits?

> The web people would want to encode it all in ASCII
> HTML-style tags which I think is a really bad idea.

Why?  Most office suites have a decent HTML export functionality.  Various  
HTML-editors are available.  HTML and XML are not only popular in web-type  
applications.

> I needed to define some additional
> characters that weren't in ASCII (and aren't in UNICODE)
> for the purposes of the programming language (which
> predates UNICODE and UTF-8 BTW)

I am curious to know which language that was, and which characters they  
are.




-- 
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/

Reply by Stephen Pelc ●January 29, 20092009-01-29

On Thu, 29 Jan 2009 09:28:09 -0600, "Paul K. McKneely"
<pkmckneely@sbcglobal.net> wrote:

>My fault for phrasing my original question badly.  I should
>never have mentioned the words "character set".  Forget that
>there is an internal encoding method that is used in the compiler
>tools for this new language whose codes will never be seen by its users.
>The programming lanugage supports only a subset of the complete
>UNICODE character set regarding the Western European
>alphabetics.

There's probably more to be gained in the long term by sticking
with a current standard of encoding. I say this because the real
internationalisation issues are not in the character set, but in
translation and display. Western Europe is the least of your
problems, without even considering right-to-left display.

When you internationalise an application, even an embedded one,
a standard process is to send your text to be translated from
English (7 bit ASCII plus a few specials) to your dealer, who
translates the messages into his/her language and sends it back
to you.

Because you're writing a program for humans to use, you include
things like dates, times, and currency. These all vary in format
across the world. In addition, parameter order will vary in different
spoken languages.

On a PC consider what happens when a program written in English
by South Africans (three languages in daily use in the office),
is run in Hong Kong on a PC with a Chinese operating system but
for use by a Russian engineer who wants his package to display
Cyrillic (several encodings available). This scenario has been
seen in the wild. One customer of ours supports 17 different
spoken languages in multiple encodings.

For most embedded systems it's not as extreme as that, especially
as there's usually no operating system. Despite this, having to
support several languages, even within the same country, is
normal. We still have to support varying display orders in
embedded systems.

Stephen

-- 
Stephen Pelc, stephenXXX@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Reply by David Brown ●January 29, 20092009-01-29

Paul K. McKneely wrote:
>> I still do not understand why you want to use some own internal
>> representation instead of e,g. UTF-8. For any language using a Latin
>> script for identifiers, the effective string length is 1.0x or rare
>> cases 1.1x times the length of the identifier. For Cyrillic or Greek,
>> the ratio is 2.0.
> 

I would suggest you start by giving up on all your thoughts of specific 
character sets.  Simply make a straight decision now - you will use 
UTF-8.  No other encodings - no Latin-1, no UTF-16, no home-made 
character sets, no extra fonts.  Take it as a fixed decision and work 
with it for a few days to see how it fits your needs.  Look at existing 
tools and source code that supports UTF-8, and see how it can make your 
work easier and give a result that users might actually be able to 
*use*.  If you really put in this effort and find that UTF-8 does not 
fit your needs, what have you lost?  A couple of days work here is a 
drop in the ocean compared to the man-years it will take to work with 
your home-made encoding, and you will at least have the benefit of a 
better understanding of your problem.  You might even be able to explain 
it to other people in a way that makes sense.

> Simply encoding a kazillion different characters
> is not the whole picture.  As Boudewijn Dijkstra
> pointed out, trying to alphabetize all of the potential
> UNICODE variables is impossible.  (Those are his
> words, not mine and the ramifications go far beyond
> just this issue).  So how do you alphabetize, search
> and list on an unwieldy character set for many
> purposes such as showing a listing to the programmer

If you need to alphabetize, there should be no shortage of existing 
library routines for sorting in UTF-8.  It's not easy - differences in 
locales can cause endless troubles, so you might not get a perfect 
solution.  But you'll find something that does a reasonable job and 
*will* work perfectly for most programmers who stick to ASCII identifiers.

A related problem is if you are making identifiers case-insensitive - 
it's hard to figure out cases for non-ASCII characters.  So stick to 
case-sensitive identifiers.

> in his tool chain?  That is not to mention that 21-bits
> (or 32-bits) are already used up in just the character's
> code.  

I have no clue as to what you are talking about here.

> The new programming language supports fonts,
> color (foreground and background), attributes, size etc.
> Do you think it is a good idea
> to have to expand these basic character codes to
> 64/ 96/128 or even 256 bits in width just to cram it all in?
> The web people would want to encode it all in ASCII
> HTML-style tags which I think is a really bad idea.

Are you suggesting that you are including font, colour, etc., directly 
in the source code?  And here was me thinking that a proprietary 
character encoding was an "amazingly stupid idea".

> The overwhelming consensus among responders to these
> threads have voiced that they are not going to use
> anything beyond ASCII anyway.  And with all of
> this text stuff, you haven't even begun to talk about
> how you are going to achieve all of the very advanced
> (and very difficult) stuff in the programming language,
> (much of which hasn't ever been done before)
> while carrying this huge load of excess baggage

Who is "you" who are going to achieve all this?  Do you mean the 
developers of the tools (i.e., you and your colleagues), or do you mean 
your users?  And if it is us potential users, what is this "very 
advanced stuff" you are talking about?  If we knew the specific aims of 
your language - what it is that makes it better than existing 
alternatives - it would be easier to advise you.

> on your back.  I needed to define some additional
> characters that weren't in ASCII (and aren't in UNICODE)
> for the purposes of the programming language (which
> predates UNICODE and UTF-8 BTW) Additional

First off, you do *not* need to define additional characters.  It's 
conceivable that your tools might *benefit* from additional characters 
(although, as I said, we know nothing about your tools).  But they don't 
*need* them.

Secondly, Unicode has openings for additional domain-specific characters 
- you can add them without losing all the other benefits of Unicode (of 
course, you'll have to provide a suitable font).

> characters in APL being sited as the downfall for that
> language is not well founded in light of the fact that,
> when it came out, you had to put out a couple of
> thousand dollars for a hard-wired specialized
> terminal just to program in that language.  That is
> besides the fact that it was not designed for the
> kinds of things that I want to do with it (such as
> writing operating systems and device drivers)
> Do you see my point(s)?
> 

No, I don't see your point at all.  It reads as though you are saying 
APL's lack of popularity was not that it had extra characters, but that 
it needed an expensive specialised terminal (which was solely because of 
its special characters).

The main reason for APL's lack of popularity *is* the special 
characters.  Even though you don't need special hardware (you use a 
specialised keyboard map and extra fonts), the characters make it 
impossible to read and understand for the non-expert, and extremely slow 
to enter expressions.  It is *vastly* easier to write for example 
"range(R)" than "&iota;R" because you don't have to find the special 
character.  It is also *vastly* easier to read and pronounce, and to 
understand "range(R)" than "&iota;R" even if you have never used the language 
in question (Python).  To take an example from wikipedia's APL page, 
here is an expression to give a list of prime numbers up to R:

	(&sim;R&isin;R&deg;.&times;R)/R&larr;1&darr;&iota;R

The direct Python translation would be:

	[p for p in range(2, R+1) if not p in [x*y for x in
				range(2, R+1) for y in range(2, R+1)]]

The APL version is certainly shorter - but nevertheless is slower and 
harder to write.  APL's power and conciseness comes from the power of 
its built-in functions, not the fact that most have a single weird 
symbol instead of a multi-character name.

> Simple, lean and mean,  but more powerful
> than anything we have now.  That is what I am
> shooting for.  When symbols need to be
> converted to whatever format when object
> files are produced, that's where the necessary
> conversions will be done.
> This will keep the core of the tools much simpler
> (and smaller and run faster) so that the whole project
> won't collapse when I try to do the really difficult
> things that were the primary goals that I started
> out to accomplish in the first place.
> 
>> So the extra memory consumption e.g. in compiler symbol tables are
>> negligible.
>>
>> Regarding linkers, UTF-8 global symbol names should not be a problem,
>> unless the object language uses the 8th bit for some kind of signaling
>> (such as end of string) or otherwise limits the valid bit
>> combinations.
>>
>> Of course the UTF-8 encoding may increase the identifier length, but
>> at least for a linker that usually examines only a specific number  of
>> bytes, such as 32, the only risk is that two identifiers are not
>> unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10
>> graphs in some East-Asian script.
>>
>> Paul
> 
> 
> I do want you to know that I do very much
> appreciate your input.  This issue about object
> formats supporting UNICODE is going to be
> a real help when it comes time to generating
> machine code.
> 
>

Reply by Paul K. McKneely ●January 29, 20092009-01-29

"Stephen Pelc" <stephenXXX@mpeforth.com> wrote in message 
news:498200b7.620856320@192.168.0.50...
> On a PC consider what happens when a program written in English
> by South Africans (three languages in daily use in the office),

Oh really?  Where in RSA and what languages?  (English,
Afrikaans & isiXhosa/isiZulu/Setwana...?) My wife and
I were there in September-October for almost 3 weeks.  Did
a loop from Capetown to Calvinia, Beaufort West, down to
Oudtshoorn, Knysna and back.

Previous 2 345 6 Next

Attention: European C/C++/C#/Java Programmers-Call for Input

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group