EmbeddedRelated.com
Forums

Attention: European C/C++/C#/Java Programmers-Call for Input

Started by Paul K. McKneely January 27, 2009
Hi All,

My company is developing a new programming language
targeted at continuing with the original charter by the C
language for development of Operating Systems in a
HLL as well as applications, device drivers etc.  This
language has an extended character set and, although
all of the key words will (still) be in English, identifiers
(i.e. names of things) can use additional European
characters (such as those with accents, diaeresis, cedilla etc).
For efficiency, a 254-character subset of them are
going to be used in creating a character space
that encodes them into a single byte.  These will
not only be automatically byte-endian independent
but will also be in alphabetic order so that sorting
can take place directly on their numeric values.  What
I need from you is input so that I can select the most
appropriate set for the benefit of European programmers
who are obviously very talented at what they do.
My thought is that it would be great if European
programmers could give names to variables etc. in their
own native languages that have more meaning for
them than just plain old English words.  The character
subset includes full upper and lower case Greek as well
as Cyrillic.  I have seen Cyrillic (as well as Greek) with
various accent marks (presumably used by eastern
European countries)but there is not enough space in a
byte to add any of these.  However, I have added
quite a few to the basic Roman character set that is
used so much in English.  Since I am an American, I
don't have full appreciation for all of these special
marks and symbols and that is why I am asking for
your comments.  I apologize for the low resolution
of the glyphs (8 X 16).  I do have a TrueType version
in the works but it is incomplete.  In the table, columns
0-8 are Roman and its variants.  Greek is columns
9-B.  Cyrillic occupies columns C-F.  I was surprised
how neatly these fell into columns.  A reference on the
subject of European character sets would be
much appreciated.  For those of you who are happy
to give me feedback, I have attached a table that
I have been using that represents the current
subset used for identifiers.  You may respond
directly to me or to the news group for all to
see.  Much thanks to you.

Regards,

Paul King McKneely
technoventure, inc.


After reading your post, I must conclude that you are oblivious to key  
concepts and organizations surrounding internationalization and  
multilingual co-operations.  It is a good thing that you sought advice  
 from an intelligable community before re-inventing the wheel (badly).

Op Tue, 27 Jan 2009 15:09:33 +0100 schreef Paul K. McKneely  
<pkmckneely@sbcglobal.net>:
> My company is developing a new programming language > targeted at continuing with the original charter by the C > language for development of Operating Systems in a > HLL as well as applications, device drivers etc. This > language has an extended character set
Like Java does? http://java.sun.com/docs/books/jls/third_edition/html/lexical.html
> and, although > all of the key words will (still) be in English, identifiers > (i.e. names of things) can use additional European > characters
Why just Europeans? Lots of software is written by Israeli (Hebrew), North-African (Arabic), Chinese (thousands of ideographs in different families) and Japanese (Katakana) people.
> (such as those with accents, diaeresis, cedilla etc). > For efficiency, a 254-character subset of them are > going to be used in creating a character space > that encodes them into a single byte.
Like ISO8859? http://www.unicode.org/Public/MAPPINGS/ISO8859/
> These will > not only be automatically byte-endian independent > but will also be in alphabetic order so that sorting > can take place directly on their numeric values.
Impossible. Not every language sorts the same alphabet in the same way. E.g. sometimes accented characters are treated separately, sometimes they are 'equal' to the base character. The process of comparing text for sorting purposes is called collation. http://unicode.org/reports/tr10/
> What > I need from you is input so that I can select the most > appropriate set for the benefit of European programmers > who are obviously very talented at what they do. > My thought is that it would be great if European > programmers could give names to variables etc. in their > own native languages that have more meaning for > them than just plain old English words.
As far as I'm concerned, English is the only language that should be seen in source code elements (except maybe string literals). It is the language of choice for technical terms, the language from which programming languages derive their syntax, and overall the best known language amongst programmers worldwide. English is one of the few languages without accents and with relatively short words, thus allowing relatively efficient typing. There are two other arguments against your proposal: - As companies grow, their code flows across language borders. Should they hire translators to facilitate teams in different areas or hire teachers to teach everybody the language of choice? - Multilingual countries like Belgium and Switzerland need to program in English in order to maintain the 'equality' of their individual languages.
> The character > subset includes full upper and lower case Greek as well > as Cyrillic. I have seen Cyrillic (as well as Greek) with > various accent marks (presumably used by eastern > European countries)but there is not enough space in a > byte to add any of these. However, I have added > quite a few to the basic Roman character set that is > used so much in English. Since I am an American, I > don't have full appreciation for all of these special > marks and symbols
Some of those are essential to be able to write common words in a given language. (I hope that you will learn to appreciate the special marks and symbols used by your Spanish-speaking fellow-Americans (amongst others), before you inadvertantly insult one.)
> and that is why I am asking for > your comments. I apologize for the low resolution > of the glyphs (8 X 16).
I have received no glyphs.
> I do have a TrueType version > in the works but it is incomplete. In the table, columns > 0-8 are Roman and its variants. Greek is columns > 9-B. Cyrillic occupies columns C-F. I was surprised > how neatly these fell into columns. A reference on the > subject of European character sets would be > much appreciated.
As stated, ISO8859 et al. Note that Microsoft has sactioned different character sets, Cp1252 is perhaps the most ubiqutous. http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/
> For those of you who are happy > to give me feedback, I have attached
Many USENET servers don't accept attachments. Please post a weblink.
> a table that > I have been using that represents the current > subset used for identifiers. You may respond > directly to me or to the news group for all to > see. Much thanks to you.
-- Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/
Paul K. McKneely wrote:
> This language has an extended character set and, although > all of the key words will (still) be in English, identifiers > (i.e. names of things) can use additional European > characters (such as those with accents, diaeresis, cedilla etc).
Like they already can in Java, C, and C++? Support for Unicode characters is in the C and C++ standards, but many compilers don't implement it. This may give you a hint how many people want it. Being German, I am satisfied if I can use my f&uuml;nny ch&auml;racters in c&ouml;mments and strings. But even there, any code that has a remote chance of being shared with anyone else gets English comments. When a Finn came along wanting to help with a program I wrote, I had quite some work to explain my German comments (as an excuse, however, some of them were at that time over 10 years old, written in a time where I was not so fluent in English). But I would definitely not switch programming languages just to use my funny characters.
> For efficiency, a 254-character subset of them are > going to be used in creating a character space > that encodes them into a single byte. These will > not only be automatically byte-endian independent > but will also be in alphabetic order so that sorting > can take place directly on their numeric values.
Why ignore Unicode and invent yet another incompatible encoding? How should people edit their source code? Remember, you'd have to build a whole toolchain supporting your new character set. If my embedded programs make serial outputs in German, they use the Latin transcription, because terminal programs don't even agree upon whether to use Latin-1 or Codepage-437/-850. Automatic alphabetic sorting is not a useful goal one would want from a character encoding, because it's not possible in general, and doesn't save you any work if you want to do it right for your problem. - In German telephone books, "&auml;" sorts as "ae" (the official Latin transcription). In German dictionaries, "&auml;" sorts as "a". In Finnish, it sorts after "z". - Almost everywhere, "&szlig;" sorts as "ss". It also doesn't have a wide-spread capital equivalent (although an Unicode codepoint has been allocated for it recently). - In Turkish, the capital letter of "i" is "&#304;" (U+0130), and the lower-case letter of the thing you know as a capital "I" is "&#305;" (U+0131). Even though it might be possible to fit most Western and Central European languages plus the standard ASCII repertoire into a common 8-bit character set, you'll probably have to ignore Cyrillic and Greek, and still tweak a bit. Latin-1 and Latin-2 taken together have about 280 characters, not counting control charactes. One attempt of such a character set is the EBU character set used in RDS/RDBS, e.g. ftp://ftp.rds.org.uk/pub/acrobat/rbds1998.pdf page 92; I haven't checked how complete it is. However, it was probably designed with the intend to implement it on 8-bit micros :-)
> A reference on the subject of European character sets would be > much appreciated.
"The Unicode Standard, Version 5.0". Plus Wikipedia. Stefan
Op Tue, 27 Jan 2009 19:54:46 +0100 schreef Stefan Reuther
<stefan.news@arcor.de>:
> Even though it might be possible to fit most Western and Central > European languages plus the standard ASCII repertoire into a common > 8-bit character set, you'll probably have to ignore Cyrillic and Greek, > and still tweak a bit. Latin-1 and Latin-2 taken together have about 280 > characters, not counting control charactes. > > One attempt of such a character set is the EBU character set used in > RDS/RDBS, e.g. ftp://ftp.rds.org.uk/pub/acrobat/rbds1998.pdf page 92; I > haven't checked how complete it is. However, it was probably designed > with the intend to implement it on 8-bit micros :-)
And low-quality graphics, too. E.g. greek small and capital theta were merged. Also they have omitted Greek and Cyrillic A(lfa) and B(eta) because the appearance is the same as latin A and B. So strictly speaking it is not a character set but a glyph encoding. "The three code-tables each contain almost all the characters in the international reference version of ISO Publication 646." ISO 646 is the predecessor to Unicode; in that time they thought that 16 bits would be enough for all conceivable characters. -- Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/
Boudewijn Dijkstra wrote:

> ISO 646 is the predecessor to Unicode; in that time they thought that 16 > bits would be enough for all conceivable characters.
Are there any important surrogate planes in unicode? I don't mean things like this one :-) http://www.unicode.org/charts/PDF/Unicode-5.1/U51-1F000.pdf -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Frank Buss schrieb:
> Boudewijn Dijkstra wrote: > >> ISO 646 is the predecessor to Unicode; in that time they thought that 16 >> bits would be enough for all conceivable characters. > > Are there any important surrogate planes in unicode? I don't mean things > like this one :-) > > http://www.unicode.org/charts/PDF/Unicode-5.1/U51-1F000.pdf
I am looking forward to read source-code like this: principal(de_tout arg_compteur, signe *arg_horaire) { &#1575;&#1604;&#1576;&#1585;&#1606;&#1575;&#1605;&#1580; &#1575;&#1604;&#1585;&#1574;&#1610;&#1587;&#1610; // kokonaisluku hakemisto; terwijl (de kleinere tellers zeven is) { } "geschwofelte Klammer zu" ... English or any other "lingua franca" is OK. SCNR, Falk
Hi,

> I am looking forward to read source-code like this: > > principal(de_tout arg_compteur, signe *arg_horaire) > { > ???????? ??????? // > kokonaisluku hakemisto; > terwijl (de kleinere tellers zeven is) { > } "geschwofelte Klammer zu"
Now that IS funny. This is the very thing that the programming community doesn't want. Don't forget, Arabic and Hebrew are read from right to left. Is the above code what an LR parser is for? Or should it be called an LR/RL parser? What I had in mind is more like ?=3.1415926; The English speaking world has used a lot of Greek letters for variables during that past few centuries. It wouldn't be much of a shock for programmers to suddenly be able to use ? instead of pi. Paul
"Boudewijn Dijkstra" <boudewijn@indes.com> wrote in message 
news:op.uoe8ixqyy6p7a2@azrael.lan...
> After reading your post, I must conclude that you are oblivious to key > concepts and organizations surrounding internationalization and > multilingual co-operations. It is a good thing that you sought advice > from an intelligable community before re-inventing the wheel (badly).
Thank you for being so polite and humble. Let me say that the new language is not about internationalization. It is about providing a much more powerful programming environment than is available with standard languages. (I know I expect to get a lot of flames from that last statement. I understand that there are a lot of insecure people in the world who will feel outrage with just about anything I have to say. Such is the price for a small amount of useful feedback).
> Like Java does? > http://java.sun.com/docs/books/jls/third_edition/html/lexical.html
No body in their right mind would try to write an operating system (or a device driver!) in Java. With no pointers and only signed integers, it would be like programming with a straight jacket on. And what would happen when an interrupt happened and the Java engine decided it was time for garbage-collection in the middle of an interrupt service routine?
> Why just Europeans? Lots of software is written by Israeli (Hebrew), > North-African (Arabic), Chinese (thousands of ideographs in different > families) and Japanese (Katakana) people.
Let me answer your question with your own words:
> As far as I'm concerned, English is the only language that should be seen > in source code elements (except maybe string literals). It is the > language of choice for technical terms, the language from which > programming languages derive their syntax, and overall the best known > language amongst programmers worldwide. English is one of the few > languages without accents and with relatively short words, thus allowing > relatively efficient typing.
> Impossible. Not every language sorts the same alphabet in the same way. > E.g. sometimes accented characters are treated separately, sometimes they > are 'equal' to the base character. The process of comparing text for > sorting purposes is called collation. > http://unicode.org/reports/tr10/
The output of the software development tool chain is for programmers only. I don't think everyone else will care if the ordinal rules don't conform to every village on the planet.
> There are two other arguments against your proposal: > - As companies grow, their code flows across language borders. Should > they hire translators to facilitate teams in different areas or hire > teachers to teach everybody the language of choice? > - Multilingual countries like Belgium and Switzerland need to program in > English in order to maintain the 'equality' of their individual languages.
I didn't propose anything. I asked for input. And your comments are well taken but do not address my request.
> (I hope that you will learn to appreciate the special marks and symbols > used by your Spanish-speaking fellow-Americans (amongst others), before > you inadvertantly insult one.)
Sort of like the way you started to insult me with your first remarks? I can see that. I'll try not to follow your lead and I might be okay. I worked with a Mexican-American one time who did give me useful feedback along with a funny story. The ?/? are in the character subset. Paul
On Tue, 27 Jan 2009 17:31:01 -0600, "Paul K. McKneely"
<pkmckneely@sbcglobal.net> wrote:

>What I had in mind is more like ?=3.1415926; >The English speaking world has used a lot of >Greek letters for variables during that past >few centuries. It wouldn't be much of a >shock for programmers to suddenly be >able to use ? instead of pi.
That is one good usage for an extended character set that I would have needed several times. However, I do not understand the need to invent yet another single byte character encoding. Why not simply use Unicode with UTF-8 encoding and if necessary, restrict it with a suitable subset, such as MES-2 or WGL-4 http://en.wikipedia.org/wiki/Unicode#Standardized_subsets to simplify editing on various platforms (availability of fonts etc.). Paul
Paul K. McKneely wrote:

> "Boudewijn Dijkstra" <boudewijn@indes.com> wrote in message > news:op.uoe8ixqyy6p7a2@azrael.lan... > >> Like Java does? >> http://java.sun.com/docs/books/jls/third_edition/html/lexical.html > > No body in their right mind would try to write an operating system > (or a device driver!) in Java.
I think Boudewijn just wanted to show you a language, which already has the feature you want, so maybe it would be helpful to take a look at it for designing your new language.
> With no pointers and only signed > integers, it would be like programming with a straight jacket on. > And what would happen when an interrupt happened and the > Java engine decided it was time for garbage-collection in the > middle of an interrupt service routine?
Looks like there is already such a system: http://www.jnode.org/node/132 I don't know it in detail, but looks like they can use hardware resources in an object oriented and safe way: http://www.jnode.org/node/40 And Microsoft has a research project, which uses a virtual machine for implementing an OS: http://research.microsoft.com/en-us/projects/singularity/ -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de