Portable Assembly| page 6

Reply by Don Y ●May 31, 20172017-05-31

On 5/31/2017 12:19 AM, Anssi Saari wrote:
> I've sometimes wondered what kind of development systems were used for
> those early 1980s home computers. Unreliable, slow and small storage
> media would've made it pretty awful to do development on target
> systems. I've read Commodore used a VAX for ROM development so they
> probably had a cross assembler there but other than that, not much idea.

You forget that those computers were typically small and ran small
applications.

In the early 80's, we regularly developed products using CP/M-hosted
tools on generic Z80 machines, RIO-based tools on "Z-boxes", motogorilla's
tools on Exormacs, etc.  None were much better than a 64K 8b machine
with one or two 1.4MB floppies.  Even earlier, the MDS-800 systems
and ISIS-II, etc.

Your development *style* changes with the capabilities of the tools
available.  E.g., in the 70's, I could turn the crank *twice* in an
8 hour shift -- edit, assemble, link, burn ROMs, debug.  So, each
iteration *had* to bring you closer to a finished product.  You
couldn't afford to just try the "I wonder if THIS is the problem"
game that seems so common, today ("Heck, I can just try rebuilding
everything and see if it NOW works...")

But, that doesn't necessarily limit you to the size of a final executable
(ever hear of overlays?) or the overall complexity of the product.

Reply by David Brown ●May 31, 20172017-05-31

On 30/05/17 15:53, Dimiter_Popoff wrote:
> On 30.5.2017 &#1075;. 00:13, David Brown wrote:
>> On 29/05/17 19:02, Dimiter_Popoff wrote:
>>> On 29.5.2017 &#1075;. 16:06, David Brown wrote:
>>
>> <snipped some interesting stuff about VPA>
>>
>>>
>>>> Some time it might be fun to look at some example functions, compiled
>>>> for either the 68K or the PPC (or, better still, both) and compare both
>>>> the source code and the generated object code to modern C and modern C
>>>> compilers.  (Noting that the state of C compilers has changed a great
>>>> deal since you started making VLA.)
>>>
>>> Yes, I would also be curious to see that. Not just a function - as it
>>> will likely have been written in assembly by the compiler author - but
>>> some sort of standard thing, say a base64 encoder/decoder or some
>>> vnc server thing etc. (the vnc server under dps is about 8 kilobytes,
>>> just looked at it. Does one type of compression (RRE misused as RLE) and
>>> raw).
>>>
>>
>> To be practical, it /should/ be a function - or no more than a few
>> functions.  (I don't know why you think functions might be written in
>> assembly by the compiler author - the compiler author is only going to
>> provide compiler-assist functions such as division routines, floating
>> point emulation, etc.)  And it should be something that has a clear
>> algorithm, so no one can "cheat" by using a better algorithm for the job.
>>
>>
> 
> I am pretty sure I have seen - or read about - compiler generated
> code where the compiler detects what you want to do and inserts
> some assembly prewritten piece of code. Was something about CRC
> or about tcp checksum, not sure - and it was someone who said that,
> I don't know it from direct experience.

A compiler sees the source code you write, and generates object code
that does that job.  It be smart about it, but it will not insert
"pre-written assembly code".  Code generation in compilers is usually
defined with some sort of templates (such a pattern for reading data at
a register plus offset, or a pattern for doing a shift by a fixed size,
etc.).  They are not "pre-written assembly", in that many of the details
are determined at generation time, such as registers, instruction
interleaving, etc.

The nearest you get to pre-written code from the compiler is in the
compiler support libraries.  For example, if the target does not support
division instructions, or floating point, then the compiler will supply
routines as needed.  These /might/ be written in assembly - but often
they are written in C.

A compiler /will/ detect patterns in your C code and use that to
generate object code rather than doing a "direct translation".  The
types of patterns it can detect varies - it is one of the things that
differentiates between compilers.  A classic example for the PPC would be:

#include <stdint.h>

uint32_t reverseLoad(uint32_t * p) {
  uint32_t x = *p;
  return ((x & 0xff000000) >> 24)
       | ((x & 0x00ff0000) >> 8)
       | ((x & 0x0000ff00) << 8)
       | ((x & 0x000000ff) << 24);
}

I am using gcc 4.8 here, since there is a convenient online version as
part of the <https://gcc.godbolt.org/> "compiler explorer".  gcc is at
7.0 these days, and has advanced significantly since then - but that is
the version that is most convenient.

A direct translation (compiling with no optimisation) would be:

reverseLoad:
        stwu 1,-48(1)
        stw 31,44(1)
        mr 31,1
        stw 3,24(31)
        lwz 9,24(31)
        lwz 9,0(9)
        stw 9,8(31)
        lwz 9,8(31)
        srwi 10,9,24
        lwz 9,8(31)
        rlwinm 9,9,0,8,15
        srwi 9,9,8
        or 10,10,9
        lwz 9,8(31)
        rlwinm 9,9,0,16,23
        slwi 9,9,8
        or 10,10,9
        lwz 9,8(31)
        slwi 9,9,24
        or 9,10,9
        mr 3,9
        addi 11,31,48
        lwz 31,-4(11)
        mr 1,11
        blr

Gruesome, isn't it?  Compiling with -O0 puts everything on the stack
rather than holding variables in registers.  Code like that was used in
the old days - perhaps at the time when you decided you needed something
better than C.  But even then, it was mainly only for debugging - since
debugger software was not good enough to handle variables in registers.

Next up, -O1 optimisation.  This is a level where the code becomes
sensible, but not too smart - and it is not uncommon to use it in
debugging because you usually get a one-to-one correspondence between
lines in the source code and blocks of object code.  It makes it easier
to do single stepping.

reverseLoad:
        lwz 9,0(3)
        slwi 3,9,24
        srwi 10,9,24
        or 3,3,10
        rlwinm 10,9,24,16,23
        or 3,3,10
        rlwinm 9,9,8,8,15
        or 3,3,9
        blr

Those that can understand the PPC's bit field instruction "rlwinm" will
see immediately that this is a straightforward translation of the source
code, but with all data held in registers.

But if we ask for smarter optimisation, with -O2, we get:

reverseLoad:
        lwbrx 3,0,3
        blr

This is, of course, optimal.  (Even the function call overhead will be
eliminated if the compiler can do so when the function is used.)

> 
> But if the compiler does this it will be obvious enough.

If you had some examples or references, it would be easier to see what
you mean.

> 
> Anyway, a function would do - if complex and long enough to
> be close to real life, i.e. a few hundred lines.

A function that is a few hundred lines of source code is /not/ real life
- it is broken code.  Surely in VLA you divide your code into functions
of manageable size, rather than single massive functions?

> 
> But I don't see why not compare written stuff, I just checked
> again on that vnc server for dps - not 8k, closer to 11k (the 8k
> I saw was a half-baked version, no keyboard tables inside it etc.;
> the complete version also includes a screen mask to allow it
> to ignore mouse clicks at certain areas, that sort of thing).
> Add to it some menu (it is command line option driven only),
> a much more complex menu than windows and android RealVNC has
> I have and it adds up to 25k.
> Compare this to the 350k exe for windows or to the 4M for Android
> (and the android does only raw...) and the picture is clear enough
> I think.
> 

A VNC server is completely useless for such a test.  It is far too
complex, with far too much variation in implementation and features, too
many external dependencies on an OS or other software (such as for
networking), and far too big for anyone to bother with such a comparison.

You specifically need something /small/.  The algorithm needs to be
simple and clearly expressible.  Total source code lines in C should be
no more than about a 100, with no more than perhaps 3 or 4 functions.
Smaller than that would be better, as it would make it easier for us to
understand the VLA and see its benefits.

Here is a possible example:

// Type for the data - this can easily be changed
typedef float data_t;

static int max(int a, int b) {
  return (a > b) ? a : b;
}

static int min(int a, int b) {
  return (a < b) ? a : b;
}

// Calculate the convolution of two input arrays pointed to by
// pA and pB, placing the results in the output array pC.
void convolute(const data_t * pA, int lenA, const data_t * pB,
    int lenB, data_t * pC, int lenC) {

  // i is the index of the output sample, run from 0 to lenC - 1
  // For each i, we calculate the sum as j goes from -inf to +inf
  // of A(j) * B(i - j)
  // Clearly we can limit j to the range 0 to (lenA - 1)
  // We use k to hold i - j, which will run down as j runs up.
  // k will be limited to (lenB - 1) down to 0.
  // From (i - j) >= 0, we have j <= i
  // From (i - j) < lenB, we have j > (i - lenB)
  // These give us tighter bounds on the run of j

  for (int i = 0; i < lenC; i++) {
    int firstJ = max(0, 1 + i - lenB);
    int endJ = min(lenA, i + 1);
    data_t x = 0;
    for (int j = firstJ; j < endJ; j++) {
      int k = i - j;
      x += (pA[j] * pB[k]);
    }

    pC[i] = x;
  }
}

With gcc 4.8 for the PPC, that's about 55 lines of assembly.  An
interesting point is that the size and instructions are very similar
with -O1 and -O2, but the ordering is significantly different - with
-O2, the pipeline scheduling is considered.  (I don't know which
particular cpu model is used for scheduling by default in gcc.)

To be able to compare with VLA, you'd have to write this algorithm in
VLA.  Then you could compare various points.  It should be easy enough
to look at the size of the code.  For speed comparison, we'd have to
know your target processor and compile specifically for that (to get the
best scheduling, and to handle small differences in the availability of
particular instructions).  Then you would need to run the code - I don't
have any PPC boards conveniently on hand, and of course you are the only
one with VLA tools.

Comparing code clarity and readability is, of course, difficult - but
you could publish your VLA and we can maybe get an idea.  Productivity
is also hard to measure.  For a function like this, the time is spent on
the details of the algorithm and getting the right bounds on the loops -
the actual C code is easy.

You can get a gcc 5.2 cross-compiler for PPC for Windows from here
<http://tranaptic.ca/wordpress/downloads/>, or you can use the online
compiler at <https://gcc.godbolt.org/>.  The PowerPC is not nearly as
popular an architecture as ARM, and it is harder to find free
ready-built tools (though there are plenty of guides to building them
yourself, and you can get supported commercial versions of modern gcc
from Mentor/CodeSourcery).  You can also find tools directly from
Freescale/NXP.

Reply by Tauno Voipio ●May 31, 20172017-05-31

On 31.5.17 10:19, Anssi Saari wrote:
> David Brown <david.brown@hesbynett.no> writes:
> 
>> Writing a game involves a great deal more than just the coding.
>> Usually, the coding is in fact just a small part of the whole effort -
>> all the design of the gameplay, the storyline, the graphics, the music,
>> the algorithms for interaction, etc., is inherently cross-platform.  The
>> code structure and design is also mostly cross-platform.  Some parts
>> (the graphics and the music) need adapted to suit the limitations of the
>> different target platforms.  The final coding in assembly would be done
>> by hand for each target.
> 
> I've sometimes wondered what kind of development systems were used for
> those early 1980s home computers. Unreliable, slow and small storage
> media would've made it pretty awful to do development on target
> systems. I've read Commodore used a VAX for ROM development so they
> probably had a cross assembler there but other than that, not much idea.


I used an Intel MDS and a Data General Eclipse to bootstrap a
Z80-based CP/M computer (self made). After that, the CP/M
system could be used to create the code, though the 8 inch
floppies were quite small for the task.

-- 

-Tauno Voipio

Reply by Dimiter_Popoff ●June 1, 20172017-06-01

On 31.5.2017 &#1075;. 12:36, David Brown wrote:
> On 30/05/17 15:53, Dimiter_Popoff wrote:
>> On 30.5.2017 &#1075;. 00:13, David Brown wrote:
>>> On 29/05/17 19:02, Dimiter_Popoff wrote:
>>>> On 29.5.2017 &#1075;. 16:06, David Brown wrote:
>>>
>>> <snipped some interesting stuff about VPA>
>>>
>>>>
>>>>> Some time it might be fun to look at some example functions, compiled
>>>>> for either the 68K or the PPC (or, better still, both) and compare both
>>>>> the source code and the generated object code to modern C and modern C
>>>>> compilers.  (Noting that the state of C compilers has changed a great
>>>>> deal since you started making VLA.)
>>>>
>>>> Yes, I would also be curious to see that. Not just a function - as it
>>>> will likely have been written in assembly by the compiler author - but
>>>> some sort of standard thing, say a base64 encoder/decoder or some
>>>> vnc server thing etc. (the vnc server under dps is about 8 kilobytes,
>>>> just looked at it. Does one type of compression (RRE misused as RLE) and
>>>> raw).
>>>>
>>>
>>> To be practical, it /should/ be a function - or no more than a few
>>> functions.  (I don't know why you think functions might be written in
>>> assembly by the compiler author - the compiler author is only going to
>>> provide compiler-assist functions such as division routines, floating
>>> point emulation, etc.)  And it should be something that has a clear
>>> algorithm, so no one can "cheat" by using a better algorithm for the job.
>>>
>>>
>>
>> I am pretty sure I have seen - or read about - compiler generated
>> code where the compiler detects what you want to do and inserts
>> some assembly prewritten piece of code. Was something about CRC
>> or about tcp checksum, not sure - and it was someone who said that,
>> I don't know it from direct experience.
>
> A compiler sees the source code you write, and generates object code
> that does that job.  It be smart about it, but it will not insert
> "pre-written assembly code".
   Code generation in compilers is usually
> defined with some sort of templates (such a pattern for reading data at
> a register plus offset, or a pattern for doing a shift by a fixed size,
> etc.).  They are not "pre-written assembly", in that many of the details
> are determined at generation time, such as registers, instruction
> interleaving, etc.
>
> The nearest you get to pre-written code from the compiler is in the
> compiler support libraries.  For example, if the target does not support
> division instructions, or floating point, then the compiler will supply
> routines as needed.  These /might/ be written in assembly - but often
> they are written in C.
>
>
> A compiler /will/ detect patterns in your C code and use that to
> generate object code rather than doing a "direct translation".   The
 > types of patterns it can detect varies - it is one of the things that
 > differentiates between compilers.

We are referring to the same thing under different names - again.
At the end of the day everything the compiler generates is written
in plain assembly, it must be executable by the CPU.
Under "prewritten" I mean some sort of template which gets filled
with addresses etc. thing before committing.
To what lengths the compiler writers go to make common cases look
good know only the writers themselves, my memory is vague but I
do think the guy who said that a few years ago knew what he was
talking about.

>..  A classic example for the PPC would be:
>
> #include <stdint.h>
>
> uint32_t reverseLoad(uint32_t * p) {
>    uint32_t x = *p;
>    return ((x & 0xff000000) >> 24)
>         | ((x & 0x00ff0000) >> 8)
>         | ((x & 0x0000ff00) << 8)
>         | ((x & 0x000000ff) << 24);
> }
>
> I am using gcc 4.8 here, since there is a convenient online version as
> part of the <https://gcc.godbolt.org/> "compiler explorer".  gcc is at
> 7.0 these days, and has advanced significantly since then - but that is
> the version that is most convenient.
>
> A direct translation (compiling with no optimisation) would be:
>
> reverseLoad:
>          stwu 1,-48(1)
>          stw 31,44(1)
>          mr 31,1
>          stw 3,24(31)
>          lwz 9,24(31)
>          lwz 9,0(9)
>          stw 9,8(31)
>          lwz 9,8(31)
>          srwi 10,9,24
>          lwz 9,8(31)
>          rlwinm 9,9,0,8,15
>          srwi 9,9,8
>          or 10,10,9
>          lwz 9,8(31)
>          rlwinm 9,9,0,16,23
>          slwi 9,9,8
>          or 10,10,9
>          lwz 9,8(31)
>          slwi 9,9,24
>          or 9,10,9
>          mr 3,9
>          addi 11,31,48
>          lwz 31,-4(11)
>          mr 1,11
>          blr
>
> Gruesome, isn't it?  Compiling with -O0 puts everything on the stack
> rather than holding variables in registers.  Code like that was used in
> the old days - perhaps at the time when you decided you needed something
> better than C.  But even then, it was mainly only for debugging - since
> debugger software was not good enough to handle variables in registers.
>
> Next up, -O1 optimisation.  This is a level where the code becomes
> sensible, but not too smart - and it is not uncommon to use it in
> debugging because you usually get a one-to-one correspondence between
> lines in the source code and blocks of object code.  It makes it easier
> to do single stepping.
>
> reverseLoad:
>          lwz 9,0(3)
>          slwi 3,9,24
>          srwi 10,9,24
>          or 3,3,10
>          rlwinm 10,9,24,16,23
>          or 3,3,10
>          rlwinm 9,9,8,8,15
>          or 3,3,9
>          blr
>
> Those that can understand the PPC's bit field instruction "rlwinm" will
> see immediately that this is a straightforward translation of the source
> code, but with all data held in registers.
>
> But if we ask for smarter optimisation, with -O2, we get:
>
> reverseLoad:
>          lwbrx 3,0,3
>          blr
>
> This is, of course, optimal.  (Even the function call overhead will be
> eliminated if the compiler can do so when the function is used.)

Above all this is a good example how limiting the high level language
is. Just look at the source and then at the final result.

You will get *exactly* the same result (- the return) with no
optimization in vpa from the line:

  mover.l (source),r3

Logic optimization is more or less a kindergarten exercise. If you need
logic optimization you don't know what you are doing anyway so the
compiler won't be able to help much, no matter how good.

Of course if you stick by a phrase book at source level - as is the case
with *any* high level language - you will need plenty of optimization,
like your example demonstrates. I bet it will will be good only in demo
cases like yours and much less useful in real life, so the only benefit
of writing this in C is the source length, 10+ times the necessary (I 
counted it and I included a return line in the count, 238 vs. 23 bytes).
While 10 times more typing may seem no serious issue to many 10 times
higher chance to insert an error is no laughing matter, and 10 times
more obscurity just because of that is a productivity killer.

>> But if the compiler does this it will be obvious enough.
>
> If you had some examples or references, it would be easier to see what
> you mean.
>
>>
>> Anyway, a function would do - if complex and long enough to
>> be close to real life, i.e. a few hundred lines.
>
> A function that is a few hundred lines of source code is /not/ real life
> - it is broken code.  Surely in VLA you divide your code into functions
> of manageable size, rather than single massive functions?

I meant "function" not the in C subroutine kind of sense, I meant it
more as "functionality", i.e. some code doing some job. How it split
into pieces etc. will depend on many factors, language, programmer
style etc., not relevant to this discussion.

>>
>> But I don't see why not compare written stuff, I just checked
>> again on that vnc server for dps - not 8k, closer to 11k (the 8k
>> I saw was a half-baked version, no keyboard tables inside it etc.;
>> the complete version also includes a screen mask to allow it
>> to ignore mouse clicks at certain areas, that sort of thing).
>> Add to it some menu (it is command line option driven only),
>> a much more complex menu than windows and android RealVNC has
>> I have and it adds up to 25k.
>> Compare this to the 350k exe for windows or to the 4M for Android
>> (and the android does only raw...) and the picture is clear enough
>> I think.
>>
>
> A VNC server is completely useless for such a test.  It is far too
> complex, with far too much variation in implementation and features, too
> many external dependencies on an OS or other software (such as for
> networking), and far too big for anyone to bother with such a comparison.

Actually I think a comparison between two pieces of code doing the same
thing is quite telling when the difference is in the orders of
magnitude, as in this case.
Writing small benchmarking toy sort of stuff is a waste of time, I am
interested in end results.

>
> You specifically need something /small/.

No, something "small" is kind of kindergarten exercise again, it can
only be good enough to fool someone into believing this or that.
It is end results which count.

Dimiter

======================================================
Dimiter Popoff, TGI             http://www.tgi-sci.com
======================================================
http://www.flickr.com/photos/didi_tgi/

Reply by David Brown ●June 2, 20172017-06-02

On 01/06/17 21:43, Dimiter_Popoff wrote:
> On 31.5.2017 &#1075;. 12:36, David Brown wrote:
>> On 30/05/17 15:53, Dimiter_Popoff wrote:
>>> On 30.5.2017 &#1075;. 00:13, David Brown wrote:
>>>> On 29/05/17 19:02, Dimiter_Popoff wrote:
>>>>> On 29.5.2017 &#1075;. 16:06, David Brown wrote:
>>>>
>>>> <snipped some interesting stuff about VPA>
>>>>
>>>>>
>>>>>> Some time it might be fun to look at some example functions, compiled
>>>>>> for either the 68K or the PPC (or, better still, both) and compare
>>>>>> both
>>>>>> the source code and the generated object code to modern C and
>>>>>> modern C
>>>>>> compilers.  (Noting that the state of C compilers has changed a great
>>>>>> deal since you started making VLA.)
>>>>>
>>>>> Yes, I would also be curious to see that. Not just a function - as it
>>>>> will likely have been written in assembly by the compiler author - but
>>>>> some sort of standard thing, say a base64 encoder/decoder or some
>>>>> vnc server thing etc. (the vnc server under dps is about 8 kilobytes,
>>>>> just looked at it. Does one type of compression (RRE misused as
>>>>> RLE) and
>>>>> raw).
>>>>>
>>>>
>>>> To be practical, it /should/ be a function - or no more than a few
>>>> functions.  (I don't know why you think functions might be written in
>>>> assembly by the compiler author - the compiler author is only going to
>>>> provide compiler-assist functions such as division routines, floating
>>>> point emulation, etc.)  And it should be something that has a clear
>>>> algorithm, so no one can "cheat" by using a better algorithm for the
>>>> job.
>>>>
>>>>
>>>
>>> I am pretty sure I have seen - or read about - compiler generated
>>> code where the compiler detects what you want to do and inserts
>>> some assembly prewritten piece of code. Was something about CRC
>>> or about tcp checksum, not sure - and it was someone who said that,
>>> I don't know it from direct experience.
>>
>> A compiler sees the source code you write, and generates object code
>> that does that job.  It be smart about it, but it will not insert
>> "pre-written assembly code".
>   Code generation in compilers is usually
>> defined with some sort of templates (such a pattern for reading data at
>> a register plus offset, or a pattern for doing a shift by a fixed size,
>> etc.).  They are not "pre-written assembly", in that many of the details
>> are determined at generation time, such as registers, instruction
>> interleaving, etc.
>>
>> The nearest you get to pre-written code from the compiler is in the
>> compiler support libraries.  For example, if the target does not support
>> division instructions, or floating point, then the compiler will supply
>> routines as needed.  These /might/ be written in assembly - but often
>> they are written in C.
>>
>>
>> A compiler /will/ detect patterns in your C code and use that to
>> generate object code rather than doing a "direct translation".   The
>> types of patterns it can detect varies - it is one of the things that
>> differentiates between compilers.
> 
> We are referring to the same thing under different names - again.

OK.  I think your naming and description is odd, but I am glad to see we
are getting a better understanding of what the other is saying.

> At the end of the day everything the compiler generates is written
> in plain assembly, it must be executable by the CPU.
> Under "prewritten" I mean some sort of template which gets filled
> with addresses etc. thing before committing.

I think of "prewritten" as referring to larger chunks of assembly code,
with much more concrete choices of values, registers, scheduling, etc.
You described the "prewritten" code as being easily recognisable - in
reality, the majority of the code from modern compilers is generated
from very small templates with great variability.  And on a processor
like the PPC, these will be intertwined with each other according to the
best scheduling for the chip.

As an example, if we have the function:

int foo0(int * p) {
  int a = *p * *p;
  return a;
}

The template for reading "*p" generates

	lmz 3, 0(3)

(Register r3 is used for the first parameter in the PPC eabi.  It is
also used for the return value from a function, which is why it may seem
"over used" in the examples here.  In bigger code, and when the compiler
can inline functions, it will be more flexible about register choices.
I don't know whether you follow the standard PPC eabi in your tools.)

Multiplication is another template:

	mullw 3, 3, 3

As is function exit, in this case just:

	blr

I find it very strange to consider these as "pre-written assembly".

And if the function is more complex, the intertwining causes more
mixups, making it less "pre-written":

int foo1(int * p, int * q) {
  int a = *p * *p;
  int b = *q * *q;
  return a + b;
}

foo1:
	lwz 9,0(3)
	lwz 10,0(4)
	mullw 9,9,9
	mullw 3,10,10
	add 3,9,3
	blr

> To what lengths the compiler writers go to make common cases look
> good know only the writers themselves, my memory is vague but I
> do think the guy who said that a few years ago knew what he was
> talking about.

Well, it is known to the compiler writers and to users who look at the
generated code!  Certainly there is plenty of variation between tools,
with more advanced compilers working harder at this sort of thing.
Command line switches with choices of optimisation levels can also make
a big difference.

How much experience do you have of using C compilers, and studying their
output?

> 
>> ..  A classic example for the PPC would be:
>>
>> #include <stdint.h>
>>
>> uint32_t reverseLoad(uint32_t * p) {
>>    uint32_t x = *p;
>>    return ((x & 0xff000000) >> 24)
>>         | ((x & 0x00ff0000) >> 8)
>>         | ((x & 0x0000ff00) << 8)
>>         | ((x & 0x000000ff) << 24);
>> }
>>

<skipping the details>

>> But if we ask for smarter optimisation, with -O2, we get:
>>
>> reverseLoad:
>>          lwbrx 3,0,3
>>          blr
>>
>> This is, of course, optimal.  (Even the function call overhead will be
>> eliminated if the compiler can do so when the function is used.)
> 
> Above all this is a good example how limiting the high level language
> is. Just look at the source and then at the final result.

No, that is a good example of how smart the compiler is (or can be)
about generating optimal code from the source.

You may in addition view this as a limitation of the C language, which
has no direct way to specify a "bit reversed pointer".  That is fair
enough.  However, it is not really any harder than defining a function
like this, and then using it.  For situations where the compiler can't
generate ideal code, and it is particularly useful to get such optimal
assembly, it is also possible to write a simple little inline assembly
function - it is not really any harder than writing the same thing in
"normal" assembly.

Another option (for newer gcc) is to define the endianness of a struct.
 Then you can access the fields directly, and the loads and stores will
be reversed as needed.

typedef struct __attribute__((scalar_storage_order ("little-endian"))) {
  uint32_t x;
} le32_t;

uint32_t reverseLoad2(le32_t * p) {
  return p->x;
}

reverseLoad2:
        lwbrx 3,0,3
        blr

So the high level language gives you a number of options, with specific
tools giving more options, and the implementation gives you efficient
object code in the end.  You might need to define a function or macro
yourself, but that is a one-time job.

> 
> You will get *exactly* the same result (- the return) with no
> optimization in vpa from the line:
> 
>  mover.l (source),r3

When you say "no optimisation" here, does that mean that VPA supports
some kinds of optimisations?

> 
> Logic optimization is more or less a kindergarten exercise. If you need
> logic optimization you don't know what you are doing anyway so the
> compiler won't be able to help much, no matter how good.

What do you mean by "logic optimisation" ?  It is normal for a good
compiler to do a variety of strength reduction and other re-arrangements
of code to give you something with the same result, but more efficient
execution.  And it is a /good/ thing that the compiler does that - it
means you can write your source code in the clearest and most
maintainable fashion, and let the compiler generate better code.

For example, if you have a simple division by a constant:

uint32_t divX(uint32_t a) {
  return a / 5;
}

The direct translation of this would be:

divX:
	lis 4,5
	divwu 3,3,4
	blr

But a compiler can do better:

divX:			// divide by 5
        lis 9,0xcccc
        ori 9,9,52429
        mulhwu 3,3,9
        srwi 3,3,2
        blr

Such optimisation is certainly not a "kindergarten exercise", and doing
it by hand is hardly a maintainable or flexible solution.  Changing the
denominator to 7 means significant changes:

divX:			// divide by 7
        lis 9,0x2492
        ori 9,9,18725
        mulhwu 9,3,9
        subf 3,9,3
        srwi 3,3,1
        add 3,9,3
        srwi 3,3,2
        blr

> 
> Of course if you stick by a phrase book at source level - as is the case
> with *any* high level language - you will need plenty of optimization,
> like your example demonstrates. 

I still don't know what you mean with "phrase book" here.

> I bet it will will be good only in demo
> cases like yours and much less useful in real life, 

Nonsense.  The benefits of using a higher level language and a compiler
get more noticeable with larger code, as the compiler has no problem
tracking register usage, instruction scheduling, etc., across large
pieces of code - unlike a human.  And it has no problem re-creating code
in different ways when small details change in the source (such as the
divide by 5 and divide by 7 examples).

> so the only benefit
> of writing this in C is the source length, 10+ times the necessary (I
> counted it and I included a return line in the count, 238 vs. 23 bytes).

You have this completely backwards.  If I write a simple example like
this, in a manner that is compilable code, then it is going to take
longer in high-level source code.  But that is the effect of giving that
function definition.  In use, writing "reverseLoad" does not take
significantly more characters than "mover" - and with everything else
around, the C code will be much shorter.  And this was a case picked
specifically to show how some long patterns in C code can be handled by
a compiler to generate optimal short assembly sequences.

The division example shows the opposite - in C, I write "a / 7", while
in assembly you have to write 7 lines (excluding labels and blr).  And
the C code there is nicer in every way.

> While 10 times more typing may seem no serious issue to many 10 times
> higher chance to insert an error is no laughing matter, and 10 times
> more obscurity just because of that is a productivity killer.

In real code, the C source will be 10 times shorter than the assembly.
And if the assembly has enough comments to make it clear, there is
another order of magnitude difference.

> 
>>> But if the compiler does this it will be obvious enough.
>>
>> If you had some examples or references, it would be easier to see what
>> you mean.
>>
>>>
>>> Anyway, a function would do - if complex and long enough to
>>> be close to real life, i.e. a few hundred lines.
>>
>> A function that is a few hundred lines of source code is /not/ real life
>> - it is broken code.  Surely in VLA you divide your code into functions
>> of manageable size, rather than single massive functions?
> 
> I meant "function" not the in C subroutine kind of sense, I meant it
> more as "functionality", i.e. some code doing some job. How it split
> into pieces etc. will depend on many factors, language, programmer
> style etc., not relevant to this discussion.

OK.

But again, it has to be a specific clearly defined and limited
functionality.  "Write a VNC server" is not a specification - that would
take at least many dozens of pages of specifications, not including the
details of the interfacing to the network stack, the types of library
functions available, the API available to client programs that will
"draw" on the server, etc.

> 
>>>
>>> But I don't see why not compare written stuff, I just checked
>>> again on that vnc server for dps - not 8k, closer to 11k (the 8k
>>> I saw was a half-baked version, no keyboard tables inside it etc.;
>>> the complete version also includes a screen mask to allow it
>>> to ignore mouse clicks at certain areas, that sort of thing).
>>> Add to it some menu (it is command line option driven only),
>>> a much more complex menu than windows and android RealVNC has
>>> I have and it adds up to 25k.
>>> Compare this to the 350k exe for windows or to the 4M for Android
>>> (and the android does only raw...) and the picture is clear enough
>>> I think.
>>>
>>
>> A VNC server is completely useless for such a test.  It is far too
>> complex, with far too much variation in implementation and features, too
>> many external dependencies on an OS or other software (such as for
>> networking), and far too big for anyone to bother with such a comparison.
> 
> Actually I think a comparison between two pieces of code doing the same
> thing is quite telling when the difference is in the orders of
> magnitude, as in this case.

No, it is not.  The code is not comparable in any way, and does not do
the same thing except in a very superficial sense.  It's like comparing
a small car with a train - both can transport you around, but they are
very different things, each with their advantages and disadvantages.

If you want to compare your VNC server for DPS written in VPA to a VNC
server written in C, then you would need to give /exact/ specifications
of all the features of your VNC server, and exact details of how it
interfaces with everything else in the DPS system, and have someone
write a VNC server in C for DPS that follows those same specifications.
 That would be no small feat - indeed, it would totally impossible
unless you wanted to do it yourself.

The nearest existing comparison I can think of would be the eCos VNC
server, written in C.  I can't say how it compares in features with your
server, but it has approximately 2100 lines of code, written in a wide
style.  Since I have no idea about how interfacing with DPS compares
with interfacing with eCos (I don't know either system), I have no idea
if that is a useful comparison or not.

> Writing small benchmarking toy sort of stuff is a waste of time, I am
> interested in end results.
> 
>>
>> You specifically need something /small/.
> 
> No, something "small" is kind of kindergarten exercise again, it can
> only be good enough to fool someone into believing this or that.
> It is end results which count.
> 

Then we will all remain in ignorance about whether VPA is useful or not,
in comparison to developing in C.

Reply by Boudewijn Dijkstra ●June 2, 20172017-06-02

Op Sat, 27 May 2017 21:39:36 +0200 schreef rickman <gnuarm@gmail.com>:
> Someone in another group is thinking of using a portable assembler to  
> write code for an app that would be ported to a number of different  
> embedded processors including custom processors in FPGAs.  I'm wondering  
> how useful this will be in writing code that will require few changes  
> across CPU ISAs and manufacturers.
>
> I am aware that there are many aspects of porting between CPUs that is  
> assembly language independent, like writing to Flash memory.  I'm more  
> interested in the issues involved in trying to use a universal assembler  
> to write portable code in general.  I'm wondering if it restricts the  
> instructions you can use or if it works more like a compiler where a  
> single instruction translates to multiple target instructions when there  
> is no one instruction suitable.
>
> Or do I misunderstand how a portable assembler works?  Does it require a  
> specific assembly language source format for each target just like using  
> the standard assembler for the target?

LLVM has a pretty generic intermediate assembler language, though I'm not  
sure if it's meant for actually writing code in.

http://llvm.org/docs/LangRef.html#instruction-reference

Another portable assembly language is Java Bytecode, though it assumes a  
32-bit machine.


-- 
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Reply by Mike Perkins ●June 5, 20172017-06-05

On 02/06/2017 16:03, Boudewijn Dijkstra wrote:
> Op Sat, 27 May 2017 21:39:36 +0200 schreef rickman <gnuarm@gmail.com>:
>> Someone in another group is thinking of using a portable assembler to
>> write code for an app that would be ported to a number of different
>> embedded processors including custom processors in FPGAs.  I'm
>> wondering how useful this will be in writing code that will require
>> few changes across CPU ISAs and manufacturers.
>>
>> I am aware that there are many aspects of porting between CPUs that is
>> assembly language independent, like writing to Flash memory.  I'm more
>> interested in the issues involved in trying to use a universal
>> assembler to write portable code in general.  I'm wondering if it
>> restricts the instructions you can use or if it works more like a
>> compiler where a single instruction translates to multiple target
>> instructions when there is no one instruction suitable.
>>
>> Or do I misunderstand how a portable assembler works?  Does it require
>> a specific assembly language source format for each target just like
>> using the standard assembler for the target?
>
> LLVM has a pretty generic intermediate assembler language, though I'm
> not sure if it's meant for actually writing code in.
>
> http://llvm.org/docs/LangRef.html#instruction-reference

Interesting, but its not obvious who the audience is. Why would anyone 
want to learn another language that is not in common use or aligned to 
any specific CPU?

> Another portable assembly language is Java Bytecode, though it assumes a
> 32-bit machine.

I've been watching this thread for some time. My first impression was 
why not just write in C? So far that impression hasn't changed. Despite 
the odd line of CPU specific assembler code for those occasions that 
require it, C is still perhaps the most portable code you can write?

-- 
Mike Perkins
Video Solutions Ltd
www.videosolutions.ltd.uk

Reply by David Brown ●June 5, 20172017-06-05

On 05/06/17 16:39, Mike Perkins wrote:
> On 02/06/2017 16:03, Boudewijn Dijkstra wrote:
>> Op Sat, 27 May 2017 21:39:36 +0200 schreef rickman <gnuarm@gmail.com>:
>>> Someone in another group is thinking of using a portable assembler to
>>> write code for an app that would be ported to a number of different
>>> embedded processors including custom processors in FPGAs.  I'm
>>> wondering how useful this will be in writing code that will require
>>> few changes across CPU ISAs and manufacturers.
>>>
>>> I am aware that there are many aspects of porting between CPUs that is
>>> assembly language independent, like writing to Flash memory.  I'm more
>>> interested in the issues involved in trying to use a universal
>>> assembler to write portable code in general.  I'm wondering if it
>>> restricts the instructions you can use or if it works more like a
>>> compiler where a single instruction translates to multiple target
>>> instructions when there is no one instruction suitable.
>>>
>>> Or do I misunderstand how a portable assembler works?  Does it require
>>> a specific assembly language source format for each target just like
>>> using the standard assembler for the target?
>>
>> LLVM has a pretty generic intermediate assembler language, though I'm
>> not sure if it's meant for actually writing code in.
>>
>> http://llvm.org/docs/LangRef.html#instruction-reference
> 
> Interesting, but its not obvious who the audience is. Why would anyone 
> want to learn another language that is not in common use or aligned to 
> any specific CPU?

The LLVM "assembly" is intended as an intermediary language.  Front-end 
tools like clang (a C, C++ and Objective-C compiler) generate LLVM 
assembly.  Middle-end tools like optimisers and linkers "play" with it. 
and back-end tools translate it into target-specific assembly.  Each 
level can do a wide variety of optimisations.  The aim is that the whole 
LLVM system can be more modular and more easily ported to new 
architectures and new languages than a traditional multi-language 
multi-target compiler (such as gcc).  So LLVM assembly is not an 
assembly language you would learn or code in - it's the glue holding the 
whole system together.

> 
>> Another portable assembly language is Java Bytecode, though it assumes a
>> 32-bit machine.
> 
> I've been watching this thread for some time. My first impression was 
> why not just write in C? So far that impression hasn't changed. Despite 
> the odd line of CPU specific assembler code for those occasions that 
> require it, C is still perhaps the most portable code you can write?
> 

Well, yes - of course C is the sensible option here.  Depending on the 
exact type of code and the targets, Ada, C++, and Forth might also be 
viable options.  But since there is no such thing as "portable 
assembly", it's a poor choice :-)  However, the thread has lead to some 
interesting discussions, IMHO.

Reply by Don Y ●June 5, 20172017-06-05

On 6/5/2017 7:39 AM, Mike Perkins wrote:
> On 02/06/2017 16:03, Boudewijn Dijkstra wrote:
>> LLVM has a pretty generic intermediate assembler language, though I'm
>> not sure if it's meant for actually writing code in.
>>
>> http://llvm.org/docs/LangRef.html#instruction-reference
>
> Interesting, but its not obvious who the audience is. Why would anyone want to
> learn another language that is not in common use or aligned to any specific CPU?

Esperanto?  :>

>> Another portable assembly language is Java Bytecode, though it assumes a
>> 32-bit machine.
>
> I've been watching this thread for some time. My first impression was why not
> just write in C? So far that impression hasn't changed. Despite the odd line of
> CPU specific assembler code for those occasions that require it, C is still
> perhaps the most portable code you can write?

The greater the level of abstraction in a language choice, the less
control you have over expressing the minutiae of what you want done.

When I design a new processor (typ. application specific), I code up
sample algorithms using a very low level set of abstractions... virtual
registers, virtual operators, etc.

Once I'm done with a number of these, I "eyeball" the "code" and sort out
what the instructions (opcodes) should be for the processor.  I.e., invent
the "assembly language".

If I'd coded these algorithms in a HIGHER level language, I'd end up
implementing a much more "complex" processor (because it would have
to implement much more capable "primitives")

C's portability problem isn't with the language, per se, as much as it is
with the "practitioners".  It could benefit from much stricter type
checking and a lot fewer "undefined/implementation-defined behaviors"
(cuz it seems folks just get the code working on THEIR target and
never see how it fails to execute properly on any OTHER target!)

Reply by George Neuner ●June 6, 20172017-06-06

On Mon, 5 Jun 2017 09:10:10 -0700, Don Y <blockedofcourse@foo.invalid>
wrote:

>C's portability problem isn't with the language, per se, as much as it is
>with the "practitioners".  It could benefit from much stricter type
>checking and a lot fewer "undefined/implementation-defined behaviors"
>(cuz it seems folks just get the code working on THEIR target and
>never see how it fails to execute properly on any OTHER target!)

The argument always has been that if implementation defined behaviors
are locked down, then C would be inefficient on CPUs that don't have
good support for <whatever>.

Look at the (historical) troubles resulting from Java (initially)
requiring IEEE-754 compliance and that FP results be exactly
reproducible *both* on the same platform *and* across platforms.

No FP hardware fully implements any version of IEEE-754: every chip
requires software fixups to achieve compliance, and most fixup suites
are not even complete [e.g., ignoring unpopular rounding modes, etc.].
Java FP code ran slower on chips that needed more fixups, and the
requirements prevented even implementing a compliant Java on some
chips despite their having FP support.

Java ultimately had to entirely back away from its reproducibility
guarantees.  It now requires only best consistency - not exact
reproducibility - on the same platform.  If you want reproducible
results, you have to use software floating point (BigFloat), and
accept much slower code.  And by requiring consistency, it can only
approximate the performance of C code which is likewise compiled. Most
C compilers allow to eshew FP consistency for more speed ... Java does
not.

Of course, FP in general is somewhat less important to this crowd than
to other groups, and C has a lot of implementation defined behavior
unrelated to FP.  But the lesson of trying to lock down hardware
(and/or OS) dependent behavior still is important.

There is no question that C could do much better type/value and
pointer/index checking, but it likely would come at the cost of far
more explicit casting (more verbose code), and likely many more
runtime checks.

A more expressive type system would help [e.g., range integers, etc.],
but that would constitute a significant change to the language.  

Some people point to Ada as an example of a language that can be both
"fast" and "safe", but many people (maybe not in this group, but many
nonetheless) are unaware that quite a lot of Ada's type/value checks
are done at runtime and throw exceptions if they fail.

Obviously, a compiler could provide a way to disable the automated
runtime checking, and even when enabled checks can be elided if the
compiler can statically prove that a given operation will always be
safe.  But even in Ada with its far more expressive types there are
many situations in which the compiler simply can't do that.

More stringent languages like ML won't even compile if they can't
statically type check the code.  In such languages, quite a lot of
programmer effort goes toward clubbing the type checker into
submission.

TANSTAAFL,
George

Previous 4 567 8 9 Next

Portable Assembly

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group