Attention: European C/C++/C#/Java Programmers-Call for Input

Hi All,

My company is developing a new programming language
targeted at continuing with the original charter by the C
language for development of Operating Systems in a
HLL as well as applications, device drivers etc.  This
language has an extended character set and, although
all of the key words will (still) be in English, identifiers
(i.e. names of things) can use additional European
characters (such as those with accents, diaeresis, cedilla etc).
For efficiency, a 254-character subset of them are
going to be used in creating a character space
that encodes them into a single byte.  These will
not only be automatically byte-endian independent
but will also be in alphabetic order so that sorting
can take place directly on their numeric values.  What
I need from you is input so that I can select the most
appropriate set for the benefit of European programmers
who are obviously very talented at what they do.
My thought is that it would be great if European
programmers could give names to variables etc. in their
own native languages that have more meaning for
them than just plain old English words.  The character
subset includes full upper and lower case Greek as well
as Cyrillic.  I have seen Cyrillic (as well as Greek) with
various accent marks (presumably used by eastern
European countries)but there is not enough space in a
byte to add any of these.  However, I have added
quite a few to the basic Roman character set that is
used so much in English.  Since I am an American, I
don't have full appreciation for all of these special
marks and symbols and that is why I am asking for
your comments.  I apologize for the low resolution
of the glyphs (8 X 16).  I do have a TrueType version
in the works but it is incomplete.  In the table, columns
0-8 are Roman and its variants.  Greek is columns
9-B.  Cyrillic occupies columns C-F.  I was surprised
how neatly these fell into columns.  A reference on the
subject of European character sets would be
much appreciated.  For those of you who are happy
to give me feedback, I have attached a table that
I have been using that represents the current
subset used for identifiers.  You may respond
directly to me or to the news group for all to
see.  Much thanks to you.

Regards,

Paul King McKneely
technoventure, inc.

Reply by Boudewijn Dijkstra ●January 27, 20092009-01-27

After reading your post, I must conclude that you are oblivious to key  
concepts and organizations surrounding internationalization and  
multilingual co-operations.  It is a good thing that you sought advice  
 from an intelligable community before re-inventing the wheel (badly).

Op Tue, 27 Jan 2009 15:09:33 +0100 schreef Paul K. McKneely  
<pkmckneely@sbcglobal.net>:
> My company is developing a new programming language
> targeted at continuing with the original charter by the C
> language for development of Operating Systems in a
> HLL as well as applications, device drivers etc.  This
> language has an extended character set

Like Java does?
	http://java.sun.com/docs/books/jls/third_edition/html/lexical.html

> and, although
> all of the key words will (still) be in English, identifiers
> (i.e. names of things) can use additional European
> characters

Why just Europeans?  Lots of software is written by Israeli (Hebrew),  
North-African (Arabic), Chinese (thousands of ideographs in different  
families) and Japanese (Katakana) people.

> (such as those with accents, diaeresis, cedilla etc).
> For efficiency, a 254-character subset of them are
> going to be used in creating a character space
> that encodes them into a single byte.

Like ISO8859?
	http://www.unicode.org/Public/MAPPINGS/ISO8859/

> These will
> not only be automatically byte-endian independent
> but will also be in alphabetic order so that sorting
> can take place directly on their numeric values.

Impossible.  Not every language sorts the same alphabet in the same way.   
E.g. sometimes accented characters are treated separately, sometimes they  
are 'equal' to the base character.  The process of comparing text for  
sorting purposes is called collation.
	http://unicode.org/reports/tr10/

> What
> I need from you is input so that I can select the most
> appropriate set for the benefit of European programmers
> who are obviously very talented at what they do.
> My thought is that it would be great if European
> programmers could give names to variables etc. in their
> own native languages that have more meaning for
> them than just plain old English words.

As far as I'm concerned, English is the only language that should be seen  
in source code elements (except maybe string literals).  It is the  
language of choice for technical terms, the language from which  
programming languages derive their syntax, and overall the best known  
language amongst programmers worldwide.  English is one of the few  
languages without accents and with relatively short words, thus allowing  
relatively efficient typing.

There are two other arguments against your proposal:
- As companies grow, their code flows across language borders.  Should  
they hire translators to facilitate teams in different areas or hire  
teachers to teach everybody the language of choice?
- Multilingual countries like Belgium and Switzerland need to program in  
English in order to maintain the 'equality' of their individual languages.

> The character
> subset includes full upper and lower case Greek as well
> as Cyrillic.  I have seen Cyrillic (as well as Greek) with
> various accent marks (presumably used by eastern
> European countries)but there is not enough space in a
> byte to add any of these.  However, I have added
> quite a few to the basic Roman character set that is
> used so much in English.  Since I am an American, I
> don't have full appreciation for all of these special
> marks and symbols

Some of those are essential to be able to write common words in a given  
language.  (I hope that you will learn to appreciate the special marks and  
symbols used by your Spanish-speaking fellow-Americans (amongst others),  
before you inadvertantly insult one.)

> and that is why I am asking for
> your comments.  I apologize for the low resolution
> of the glyphs (8 X 16).

I have received no glyphs.

> I do have a TrueType version
> in the works but it is incomplete.  In the table, columns
> 0-8 are Roman and its variants.  Greek is columns
> 9-B.  Cyrillic occupies columns C-F.  I was surprised
> how neatly these fell into columns.  A reference on the
> subject of European character sets would be
> much appreciated.

As stated, ISO8859 et al.  Note that Microsoft has sactioned different  
character sets, Cp1252 is perhaps the most ubiqutous.
	http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/

> For those of you who are happy
> to give me feedback, I have attached

Many USENET servers don't accept attachments.  Please post a weblink.

> a table that
> I have been using that represents the current
> subset used for identifiers.  You may respond
> directly to me or to the news group for all to
> see.  Much thanks to you.




-- 
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/

Reply by Stefan Reuther ●January 27, 20092009-01-27

Paul K. McKneely wrote:
> This language has an extended character set and, although
> all of the key words will (still) be in English, identifiers
> (i.e. names of things) can use additional European
> characters (such as those with accents, diaeresis, cedilla etc).

Like they already can in Java, C, and C++?

Support for Unicode characters is in the C and C++ standards, but many
compilers don't implement it. This may give you a hint how many people
want it. Being German, I am satisfied if I can use my f&uuml;nny ch&auml;racters
in c&ouml;mments and strings. But even there, any code that has a remote
chance of being shared with anyone else gets English comments. When a
Finn came along wanting to help with a program I wrote, I had quite some
work to explain my German comments (as an excuse, however, some of them
were at that time over 10 years old, written in a time where I was not
so fluent in English). But I would definitely not switch programming
languages just to use my funny characters.

> For efficiency, a 254-character subset of them are
> going to be used in creating a character space
> that encodes them into a single byte. These will
> not only be automatically byte-endian independent
> but will also be in alphabetic order so that sorting
> can take place directly on their numeric values.

Why ignore Unicode and invent yet another incompatible encoding? How
should people edit their source code? Remember, you'd have to build a
whole toolchain supporting your new character set. If my embedded
programs make serial outputs in German, they use the Latin
transcription, because terminal programs don't even agree upon whether
to use Latin-1 or Codepage-437/-850.

Automatic alphabetic sorting is not a useful goal one would want from a
character encoding, because it's not possible in general, and doesn't
save you any work if you want to do it right for your problem.

- In German telephone books, "&auml;" sorts as "ae" (the official Latin
  transcription). In German dictionaries, "&auml;" sorts as "a". In Finnish,
  it sorts after "z".

- Almost everywhere, "&szlig;" sorts as "ss". It also doesn't have a
  wide-spread capital equivalent (although an Unicode codepoint has
  been allocated for it recently).

- In Turkish, the capital letter of "i" is "&#304;" (U+0130), and the
  lower-case letter of the thing you know as a capital "I" is "&#305;"
  (U+0131).

Even though it might be possible to fit most Western and Central
European languages plus the standard ASCII repertoire into a common
8-bit character set, you'll probably have to ignore Cyrillic and Greek,
and still tweak a bit. Latin-1 and Latin-2 taken together have about 280
characters, not counting control charactes.

One attempt of such a character set is the EBU character set used in
RDS/RDBS, e.g. ftp://ftp.rds.org.uk/pub/acrobat/rbds1998.pdf page 92; I
haven't checked how complete it is. However, it was probably designed
with the intend to implement it on 8-bit micros :-)

> A reference on the subject of European character sets would be
> much appreciated.

"The Unicode Standard, Version 5.0". Plus Wikipedia.

  Stefan

Reply by Boudewijn Dijkstra ●January 27, 20092009-01-27

Op Tue, 27 Jan 2009 19:54:46 +0100 schreef Stefan Reuther
<stefan.news@arcor.de>:
> Even though it might be possible to fit most Western and Central
> European languages plus the standard ASCII repertoire into a common
> 8-bit character set, you'll probably have to ignore Cyrillic and Greek,
> and still tweak a bit. Latin-1 and Latin-2 taken together have about 280
> characters, not counting control charactes.
>
> One attempt of such a character set is the EBU character set used in
> RDS/RDBS, e.g. ftp://ftp.rds.org.uk/pub/acrobat/rbds1998.pdf page 92; I
> haven't checked how complete it is. However, it was probably designed
> with the intend to implement it on 8-bit micros :-)

And low-quality graphics, too.  E.g. greek small and capital theta were
merged.  Also they have omitted Greek and Cyrillic A(lfa) and B(eta)
because the appearance is the same as latin A and B.  So strictly speaking
it is not a character set but a glyph encoding.

	"The three code-tables each contain almost all the characters in
	the international reference version of ISO Publication 646."

ISO 646 is the predecessor to Unicode; in that time they thought that 16
bits would be enough for all conceivable characters.




-- 
Gemaakt met Opera's revolutionaire e-mailprogramma:
http://www.opera.com/mail/

Reply by Frank Buss ●January 27, 20092009-01-27

Boudewijn Dijkstra wrote:

> ISO 646 is the predecessor to Unicode; in that time they thought that 16
> bits would be enough for all conceivable characters.

Are there any important surrogate planes in unicode? I don't mean things
like this one :-)

http://www.unicode.org/charts/PDF/Unicode-5.1/U51-1F000.pdf

-- 
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

Reply by Falk Willberg ●January 27, 20092009-01-27

Frank Buss schrieb:
> Boudewijn Dijkstra wrote:
> 
>> ISO 646 is the predecessor to Unicode; in that time they thought that 16
>> bits would be enough for all conceivable characters.
> 
> Are there any important surrogate planes in unicode? I don't mean things
> like this one :-)
> 
> http://www.unicode.org/charts/PDF/Unicode-5.1/U51-1F000.pdf

I am looking forward to read source-code like this:

principal(de_tout arg_compteur, signe *arg_horaire)
{
&#1575;&#1604;&#1576;&#1585;&#1606;&#1575;&#1605;&#1580; &#1575;&#1604;&#1585;&#1574;&#1610;&#1587;&#1610;  //
  kokonaisluku	hakemisto;
  terwijl (de kleinere tellers zeven is) {
} "geschwofelte Klammer zu"
...

English or any other "lingua franca" is OK.

SCNR,
Falk

Reply by Paul K. McKneely ●January 27, 20092009-01-27

Hi,

> I am looking forward to read source-code like this:
>
> principal(de_tout arg_compteur, signe *arg_horaire)
> {
> ???????? ???????  //
>  kokonaisluku hakemisto;
>  terwijl (de kleinere tellers zeven is) {
> } "geschwofelte Klammer zu"

Now that IS funny.  This is the very thing that the
programming community doesn't want.  Don't
forget, Arabic and Hebrew are read from right
to left.  Is the above code what an LR parser is for?
Or should it be called an LR/RL parser?
What I had in mind is more like ?=3.1415926;
The English speaking world has used a lot of
Greek letters for variables during that past
few centuries.  It wouldn't be much of a
shock for programmers to suddenly be
able to use ? instead of pi.

Paul

Reply by Paul K. McKneely ●January 27, 20092009-01-27

"Boudewijn Dijkstra" <boudewijn@indes.com> wrote in message 
news:op.uoe8ixqyy6p7a2@azrael.lan...
> After reading your post, I must conclude that you are oblivious to key 
> concepts and organizations surrounding internationalization and 
> multilingual co-operations.  It is a good thing that you sought advice 
> from an intelligable community before re-inventing the wheel (badly).

Thank you for being so polite and humble.  Let me say that the
new language is not about internationalization.  It is about providing
a much more powerful programming environment than is available
with standard languages.  (I know I expect to get a lot of flames
from that last statement.  I understand that there are a lot of
insecure people in the world who will feel outrage with just about
anything I have to say.  Such is the price for a small amount of
useful feedback).

> Like Java does?
> http://java.sun.com/docs/books/jls/third_edition/html/lexical.html

No body in their right mind would try to write an operating system
(or a device driver!) in Java.  With no pointers and only signed
integers, it would be like programming with a straight jacket on.
And what would happen when an interrupt happened and the
Java engine decided it was time for garbage-collection in the
middle of an interrupt service routine?

> Why just Europeans?  Lots of software is written by Israeli (Hebrew), 
> North-African (Arabic), Chinese (thousands of ideographs in different 
> families) and Japanese (Katakana) people.

Let me answer your question with your own words:
> As far as I'm concerned, English is the only language that should be seen 
> in source code elements (except maybe string literals).  It is the 
> language of choice for technical terms, the language from which 
> programming languages derive their syntax, and overall the best known 
> language amongst programmers worldwide.  English is one of the few 
> languages without accents and with relatively short words, thus allowing 
> relatively efficient typing.

> Impossible.  Not every language sorts the same alphabet in the same way. 
> E.g. sometimes accented characters are treated separately, sometimes they 
> are 'equal' to the base character.  The process of comparing text for 
> sorting purposes is called collation.
> http://unicode.org/reports/tr10/
The output of the software development tool chain is
for programmers only.  I don't think everyone else will care if
the ordinal rules don't conform to every village on the planet.

> There are two other arguments against your proposal:
> - As companies grow, their code flows across language borders.  Should 
> they hire translators to facilitate teams in different areas or hire 
> teachers to teach everybody the language of choice?
> - Multilingual countries like Belgium and Switzerland need to program in 
> English in order to maintain the 'equality' of their individual languages.

I didn't propose anything.  I asked for input.  And your
comments are well taken but do not address my request.

> (I hope that you will learn to appreciate the special marks and  symbols 
> used by your Spanish-speaking fellow-Americans (amongst others),  before 
> you inadvertantly insult one.)

Sort of like the way you started to insult me with your first remarks?
I can see that.  I'll try not to follow your lead and I might be okay.
I worked with a Mexican-American one time who did give me
useful feedback along with a funny story.  The ?/? are in the
character subset.

Paul

Reply by Paul Keinanen ●January 27, 20092009-01-27

On Tue, 27 Jan 2009 17:31:01 -0600, "Paul K. McKneely"
<pkmckneely@sbcglobal.net> wrote:

>What I had in mind is more like ?=3.1415926;
>The English speaking world has used a lot of
>Greek letters for variables during that past
>few centuries.  It wouldn't be much of a
>shock for programmers to suddenly be
>able to use ? instead of pi.

That is one good usage for an extended character set that I would have
needed several times. 

However, I do not understand the need to invent yet another single
byte character encoding. Why not simply use Unicode with UTF-8
encoding and if necessary, restrict it with a suitable subset, such as
MES-2 or WGL-4
http://en.wikipedia.org/wiki/Unicode#Standardized_subsets
to simplify editing on various platforms  (availability of fonts
etc.).

Paul

Reply by Frank Buss ●January 28, 20092009-01-28

Paul K. McKneely wrote:

> "Boudewijn Dijkstra" <boudewijn@indes.com> wrote in message 
> news:op.uoe8ixqyy6p7a2@azrael.lan...
> 
>> Like Java does?
>> http://java.sun.com/docs/books/jls/third_edition/html/lexical.html
> 
> No body in their right mind would try to write an operating system
> (or a device driver!) in Java. 

I think Boudewijn just wanted to show you a language, which already has the
feature you want, so maybe it would be helpful to take a look at it for
designing your new language.

> With no pointers and only signed
> integers, it would be like programming with a straight jacket on.
> And what would happen when an interrupt happened and the
> Java engine decided it was time for garbage-collection in the
> middle of an interrupt service routine?

Looks like there is already such a system:

http://www.jnode.org/node/132

I don't know it in detail, but looks like they can use hardware resources
in an object oriented and safe way:

http://www.jnode.org/node/40

And Microsoft has a research project, which uses a virtual machine for
implementing an OS:

http://research.microsoft.com/en-us/projects/singularity/

-- 
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

Previous12 3 4 5 6 Next

Attention: European C/C++/C#/Java Programmers-Call for Input

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group