EmbeddedRelated.com
Forums

Portable Assembly

Started by rickman May 27, 2017
On 5/31/2017 12:19 AM, Anssi Saari wrote:
> I've sometimes wondered what kind of development systems were used for > those early 1980s home computers. Unreliable, slow and small storage > media would've made it pretty awful to do development on target > systems. I've read Commodore used a VAX for ROM development so they > probably had a cross assembler there but other than that, not much idea.
You forget that those computers were typically small and ran small applications. In the early 80's, we regularly developed products using CP/M-hosted tools on generic Z80 machines, RIO-based tools on "Z-boxes", motogorilla's tools on Exormacs, etc. None were much better than a 64K 8b machine with one or two 1.4MB floppies. Even earlier, the MDS-800 systems and ISIS-II, etc. Your development *style* changes with the capabilities of the tools available. E.g., in the 70's, I could turn the crank *twice* in an 8 hour shift -- edit, assemble, link, burn ROMs, debug. So, each iteration *had* to bring you closer to a finished product. You couldn't afford to just try the "I wonder if THIS is the problem" game that seems so common, today ("Heck, I can just try rebuilding everything and see if it NOW works...") But, that doesn't necessarily limit you to the size of a final executable (ever hear of overlays?) or the overall complexity of the product.
On 30/05/17 15:53, Dimiter_Popoff wrote:
> On 30.5.2017 &#1075;. 00:13, David Brown wrote: >> On 29/05/17 19:02, Dimiter_Popoff wrote: >>> On 29.5.2017 &#1075;. 16:06, David Brown wrote: >> >> <snipped some interesting stuff about VPA> >> >>> >>>> Some time it might be fun to look at some example functions, compiled >>>> for either the 68K or the PPC (or, better still, both) and compare both >>>> the source code and the generated object code to modern C and modern C >>>> compilers. (Noting that the state of C compilers has changed a great >>>> deal since you started making VLA.) >>> >>> Yes, I would also be curious to see that. Not just a function - as it >>> will likely have been written in assembly by the compiler author - but >>> some sort of standard thing, say a base64 encoder/decoder or some >>> vnc server thing etc. (the vnc server under dps is about 8 kilobytes, >>> just looked at it. Does one type of compression (RRE misused as RLE) and >>> raw). >>> >> >> To be practical, it /should/ be a function - or no more than a few >> functions. (I don't know why you think functions might be written in >> assembly by the compiler author - the compiler author is only going to >> provide compiler-assist functions such as division routines, floating >> point emulation, etc.) And it should be something that has a clear >> algorithm, so no one can "cheat" by using a better algorithm for the job. >> >> > > I am pretty sure I have seen - or read about - compiler generated > code where the compiler detects what you want to do and inserts > some assembly prewritten piece of code. Was something about CRC > or about tcp checksum, not sure - and it was someone who said that, > I don't know it from direct experience.
A compiler sees the source code you write, and generates object code that does that job. It be smart about it, but it will not insert "pre-written assembly code". Code generation in compilers is usually defined with some sort of templates (such a pattern for reading data at a register plus offset, or a pattern for doing a shift by a fixed size, etc.). They are not "pre-written assembly", in that many of the details are determined at generation time, such as registers, instruction interleaving, etc. The nearest you get to pre-written code from the compiler is in the compiler support libraries. For example, if the target does not support division instructions, or floating point, then the compiler will supply routines as needed. These /might/ be written in assembly - but often they are written in C. A compiler /will/ detect patterns in your C code and use that to generate object code rather than doing a "direct translation". The types of patterns it can detect varies - it is one of the things that differentiates between compilers. A classic example for the PPC would be: #include <stdint.h> uint32_t reverseLoad(uint32_t * p) { uint32_t x = *p; return ((x & 0xff000000) >> 24) | ((x & 0x00ff0000) >> 8) | ((x & 0x0000ff00) << 8) | ((x & 0x000000ff) << 24); } I am using gcc 4.8 here, since there is a convenient online version as part of the <https://gcc.godbolt.org/> "compiler explorer". gcc is at 7.0 these days, and has advanced significantly since then - but that is the version that is most convenient. A direct translation (compiling with no optimisation) would be: reverseLoad: stwu 1,-48(1) stw 31,44(1) mr 31,1 stw 3,24(31) lwz 9,24(31) lwz 9,0(9) stw 9,8(31) lwz 9,8(31) srwi 10,9,24 lwz 9,8(31) rlwinm 9,9,0,8,15 srwi 9,9,8 or 10,10,9 lwz 9,8(31) rlwinm 9,9,0,16,23 slwi 9,9,8 or 10,10,9 lwz 9,8(31) slwi 9,9,24 or 9,10,9 mr 3,9 addi 11,31,48 lwz 31,-4(11) mr 1,11 blr Gruesome, isn't it? Compiling with -O0 puts everything on the stack rather than holding variables in registers. Code like that was used in the old days - perhaps at the time when you decided you needed something better than C. But even then, it was mainly only for debugging - since debugger software was not good enough to handle variables in registers. Next up, -O1 optimisation. This is a level where the code becomes sensible, but not too smart - and it is not uncommon to use it in debugging because you usually get a one-to-one correspondence between lines in the source code and blocks of object code. It makes it easier to do single stepping. reverseLoad: lwz 9,0(3) slwi 3,9,24 srwi 10,9,24 or 3,3,10 rlwinm 10,9,24,16,23 or 3,3,10 rlwinm 9,9,8,8,15 or 3,3,9 blr Those that can understand the PPC's bit field instruction "rlwinm" will see immediately that this is a straightforward translation of the source code, but with all data held in registers. But if we ask for smarter optimisation, with -O2, we get: reverseLoad: lwbrx 3,0,3 blr This is, of course, optimal. (Even the function call overhead will be eliminated if the compiler can do so when the function is used.)
> > But if the compiler does this it will be obvious enough.
If you had some examples or references, it would be easier to see what you mean.
> > Anyway, a function would do - if complex and long enough to > be close to real life, i.e. a few hundred lines.
A function that is a few hundred lines of source code is /not/ real life - it is broken code. Surely in VLA you divide your code into functions of manageable size, rather than single massive functions?
> > But I don't see why not compare written stuff, I just checked > again on that vnc server for dps - not 8k, closer to 11k (the 8k > I saw was a half-baked version, no keyboard tables inside it etc.; > the complete version also includes a screen mask to allow it > to ignore mouse clicks at certain areas, that sort of thing). > Add to it some menu (it is command line option driven only), > a much more complex menu than windows and android RealVNC has > I have and it adds up to 25k. > Compare this to the 350k exe for windows or to the 4M for Android > (and the android does only raw...) and the picture is clear enough > I think. >
A VNC server is completely useless for such a test. It is far too complex, with far too much variation in implementation and features, too many external dependencies on an OS or other software (such as for networking), and far too big for anyone to bother with such a comparison. You specifically need something /small/. The algorithm needs to be simple and clearly expressible. Total source code lines in C should be no more than about a 100, with no more than perhaps 3 or 4 functions. Smaller than that would be better, as it would make it easier for us to understand the VLA and see its benefits. Here is a possible example: // Type for the data - this can easily be changed typedef float data_t; static int max(int a, int b) { return (a > b) ? a : b; } static int min(int a, int b) { return (a < b) ? a : b; } // Calculate the convolution of two input arrays pointed to by // pA and pB, placing the results in the output array pC. void convolute(const data_t * pA, int lenA, const data_t * pB, int lenB, data_t * pC, int lenC) { // i is the index of the output sample, run from 0 to lenC - 1 // For each i, we calculate the sum as j goes from -inf to +inf // of A(j) * B(i - j) // Clearly we can limit j to the range 0 to (lenA - 1) // We use k to hold i - j, which will run down as j runs up. // k will be limited to (lenB - 1) down to 0. // From (i - j) >= 0, we have j <= i // From (i - j) < lenB, we have j > (i - lenB) // These give us tighter bounds on the run of j for (int i = 0; i < lenC; i++) { int firstJ = max(0, 1 + i - lenB); int endJ = min(lenA, i + 1); data_t x = 0; for (int j = firstJ; j < endJ; j++) { int k = i - j; x += (pA[j] * pB[k]); } pC[i] = x; } } With gcc 4.8 for the PPC, that's about 55 lines of assembly. An interesting point is that the size and instructions are very similar with -O1 and -O2, but the ordering is significantly different - with -O2, the pipeline scheduling is considered. (I don't know which particular cpu model is used for scheduling by default in gcc.) To be able to compare with VLA, you'd have to write this algorithm in VLA. Then you could compare various points. It should be easy enough to look at the size of the code. For speed comparison, we'd have to know your target processor and compile specifically for that (to get the best scheduling, and to handle small differences in the availability of particular instructions). Then you would need to run the code - I don't have any PPC boards conveniently on hand, and of course you are the only one with VLA tools. Comparing code clarity and readability is, of course, difficult - but you could publish your VLA and we can maybe get an idea. Productivity is also hard to measure. For a function like this, the time is spent on the details of the algorithm and getting the right bounds on the loops - the actual C code is easy. You can get a gcc 5.2 cross-compiler for PPC for Windows from here <http://tranaptic.ca/wordpress/downloads/>, or you can use the online compiler at <https://gcc.godbolt.org/>. The PowerPC is not nearly as popular an architecture as ARM, and it is harder to find free ready-built tools (though there are plenty of guides to building them yourself, and you can get supported commercial versions of modern gcc from Mentor/CodeSourcery). You can also find tools directly from Freescale/NXP.
On 31.5.17 10:19, Anssi Saari wrote:
> David Brown <david.brown@hesbynett.no> writes: > >> Writing a game involves a great deal more than just the coding. >> Usually, the coding is in fact just a small part of the whole effort - >> all the design of the gameplay, the storyline, the graphics, the music, >> the algorithms for interaction, etc., is inherently cross-platform. The >> code structure and design is also mostly cross-platform. Some parts >> (the graphics and the music) need adapted to suit the limitations of the >> different target platforms. The final coding in assembly would be done >> by hand for each target. > > I've sometimes wondered what kind of development systems were used for > those early 1980s home computers. Unreliable, slow and small storage > media would've made it pretty awful to do development on target > systems. I've read Commodore used a VAX for ROM development so they > probably had a cross assembler there but other than that, not much idea.
I used an Intel MDS and a Data General Eclipse to bootstrap a Z80-based CP/M computer (self made). After that, the CP/M system could be used to create the code, though the 8 inch floppies were quite small for the task. -- -Tauno Voipio
On 31.5.2017 &#1075;. 12:36, David Brown wrote:
> On 30/05/17 15:53, Dimiter_Popoff wrote: >> On 30.5.2017 &#1075;. 00:13, David Brown wrote: >>> On 29/05/17 19:02, Dimiter_Popoff wrote: >>>> On 29.5.2017 &#1075;. 16:06, David Brown wrote: >>> >>> <snipped some interesting stuff about VPA> >>> >>>> >>>>> Some time it might be fun to look at some example functions, compiled >>>>> for either the 68K or the PPC (or, better still, both) and compare both >>>>> the source code and the generated object code to modern C and modern C >>>>> compilers. (Noting that the state of C compilers has changed a great >>>>> deal since you started making VLA.) >>>> >>>> Yes, I would also be curious to see that. Not just a function - as it >>>> will likely have been written in assembly by the compiler author - but >>>> some sort of standard thing, say a base64 encoder/decoder or some >>>> vnc server thing etc. (the vnc server under dps is about 8 kilobytes, >>>> just looked at it. Does one type of compression (RRE misused as RLE) and >>>> raw). >>>> >>> >>> To be practical, it /should/ be a function - or no more than a few >>> functions. (I don't know why you think functions might be written in >>> assembly by the compiler author - the compiler author is only going to >>> provide compiler-assist functions such as division routines, floating >>> point emulation, etc.) And it should be something that has a clear >>> algorithm, so no one can "cheat" by using a better algorithm for the job. >>> >>> >> >> I am pretty sure I have seen - or read about - compiler generated >> code where the compiler detects what you want to do and inserts >> some assembly prewritten piece of code. Was something about CRC >> or about tcp checksum, not sure - and it was someone who said that, >> I don't know it from direct experience. > > A compiler sees the source code you write, and generates object code > that does that job. It be smart about it, but it will not insert > "pre-written assembly code".
Code generation in compilers is usually
> defined with some sort of templates (such a pattern for reading data at > a register plus offset, or a pattern for doing a shift by a fixed size, > etc.). They are not "pre-written assembly", in that many of the details > are determined at generation time, such as registers, instruction > interleaving, etc. > > The nearest you get to pre-written code from the compiler is in the > compiler support libraries. For example, if the target does not support > division instructions, or floating point, then the compiler will supply > routines as needed. These /might/ be written in assembly - but often > they are written in C. > > > A compiler /will/ detect patterns in your C code and use that to > generate object code rather than doing a "direct translation". The
> types of patterns it can detect varies - it is one of the things that > differentiates between compilers. We are referring to the same thing under different names - again. At the end of the day everything the compiler generates is written in plain assembly, it must be executable by the CPU. Under "prewritten" I mean some sort of template which gets filled with addresses etc. thing before committing. To what lengths the compiler writers go to make common cases look good know only the writers themselves, my memory is vague but I do think the guy who said that a few years ago knew what he was talking about.
>.. A classic example for the PPC would be: > > #include <stdint.h> > > uint32_t reverseLoad(uint32_t * p) { > uint32_t x = *p; > return ((x & 0xff000000) >> 24) > | ((x & 0x00ff0000) >> 8) > | ((x & 0x0000ff00) << 8) > | ((x & 0x000000ff) << 24); > } > > I am using gcc 4.8 here, since there is a convenient online version as > part of the <https://gcc.godbolt.org/> "compiler explorer". gcc is at > 7.0 these days, and has advanced significantly since then - but that is > the version that is most convenient. > > A direct translation (compiling with no optimisation) would be: > > reverseLoad: > stwu 1,-48(1) > stw 31,44(1) > mr 31,1 > stw 3,24(31) > lwz 9,24(31) > lwz 9,0(9) > stw 9,8(31) > lwz 9,8(31) > srwi 10,9,24 > lwz 9,8(31) > rlwinm 9,9,0,8,15 > srwi 9,9,8 > or 10,10,9 > lwz 9,8(31) > rlwinm 9,9,0,16,23 > slwi 9,9,8 > or 10,10,9 > lwz 9,8(31) > slwi 9,9,24 > or 9,10,9 > mr 3,9 > addi 11,31,48 > lwz 31,-4(11) > mr 1,11 > blr > > Gruesome, isn't it? Compiling with -O0 puts everything on the stack > rather than holding variables in registers. Code like that was used in > the old days - perhaps at the time when you decided you needed something > better than C. But even then, it was mainly only for debugging - since > debugger software was not good enough to handle variables in registers. > > Next up, -O1 optimisation. This is a level where the code becomes > sensible, but not too smart - and it is not uncommon to use it in > debugging because you usually get a one-to-one correspondence between > lines in the source code and blocks of object code. It makes it easier > to do single stepping. > > reverseLoad: > lwz 9,0(3) > slwi 3,9,24 > srwi 10,9,24 > or 3,3,10 > rlwinm 10,9,24,16,23 > or 3,3,10 > rlwinm 9,9,8,8,15 > or 3,3,9 > blr > > Those that can understand the PPC's bit field instruction "rlwinm" will > see immediately that this is a straightforward translation of the source > code, but with all data held in registers. > > But if we ask for smarter optimisation, with -O2, we get: > > reverseLoad: > lwbrx 3,0,3 > blr > > This is, of course, optimal. (Even the function call overhead will be > eliminated if the compiler can do so when the function is used.)
Above all this is a good example how limiting the high level language is. Just look at the source and then at the final result. You will get *exactly* the same result (- the return) with no optimization in vpa from the line: mover.l (source),r3 Logic optimization is more or less a kindergarten exercise. If you need logic optimization you don't know what you are doing anyway so the compiler won't be able to help much, no matter how good. Of course if you stick by a phrase book at source level - as is the case with *any* high level language - you will need plenty of optimization, like your example demonstrates. I bet it will will be good only in demo cases like yours and much less useful in real life, so the only benefit of writing this in C is the source length, 10+ times the necessary (I counted it and I included a return line in the count, 238 vs. 23 bytes). While 10 times more typing may seem no serious issue to many 10 times higher chance to insert an error is no laughing matter, and 10 times more obscurity just because of that is a productivity killer.
>> But if the compiler does this it will be obvious enough. > > If you had some examples or references, it would be easier to see what > you mean. > >> >> Anyway, a function would do - if complex and long enough to >> be close to real life, i.e. a few hundred lines. > > A function that is a few hundred lines of source code is /not/ real life > - it is broken code. Surely in VLA you divide your code into functions > of manageable size, rather than single massive functions?
I meant "function" not the in C subroutine kind of sense, I meant it more as "functionality", i.e. some code doing some job. How it split into pieces etc. will depend on many factors, language, programmer style etc., not relevant to this discussion.
>> >> But I don't see why not compare written stuff, I just checked >> again on that vnc server for dps - not 8k, closer to 11k (the 8k >> I saw was a half-baked version, no keyboard tables inside it etc.; >> the complete version also includes a screen mask to allow it >> to ignore mouse clicks at certain areas, that sort of thing). >> Add to it some menu (it is command line option driven only), >> a much more complex menu than windows and android RealVNC has >> I have and it adds up to 25k. >> Compare this to the 350k exe for windows or to the 4M for Android >> (and the android does only raw...) and the picture is clear enough >> I think. >> > > A VNC server is completely useless for such a test. It is far too > complex, with far too much variation in implementation and features, too > many external dependencies on an OS or other software (such as for > networking), and far too big for anyone to bother with such a comparison.
Actually I think a comparison between two pieces of code doing the same thing is quite telling when the difference is in the orders of magnitude, as in this case. Writing small benchmarking toy sort of stuff is a waste of time, I am interested in end results.
> > You specifically need something /small/.
No, something "small" is kind of kindergarten exercise again, it can only be good enough to fool someone into believing this or that. It is end results which count. Dimiter ====================================================== Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/
On 01/06/17 21:43, Dimiter_Popoff wrote:
> On 31.5.2017 &#1075;. 12:36, David Brown wrote: >> On 30/05/17 15:53, Dimiter_Popoff wrote: >>> On 30.5.2017 &#1075;. 00:13, David Brown wrote: >>>> On 29/05/17 19:02, Dimiter_Popoff wrote: >>>>> On 29.5.2017 &#1075;. 16:06, David Brown wrote: >>>> >>>> <snipped some interesting stuff about VPA> >>>> >>>>> >>>>>> Some time it might be fun to look at some example functions, compiled >>>>>> for either the 68K or the PPC (or, better still, both) and compare >>>>>> both >>>>>> the source code and the generated object code to modern C and >>>>>> modern C >>>>>> compilers. (Noting that the state of C compilers has changed a great >>>>>> deal since you started making VLA.) >>>>> >>>>> Yes, I would also be curious to see that. Not just a function - as it >>>>> will likely have been written in assembly by the compiler author - but >>>>> some sort of standard thing, say a base64 encoder/decoder or some >>>>> vnc server thing etc. (the vnc server under dps is about 8 kilobytes, >>>>> just looked at it. Does one type of compression (RRE misused as >>>>> RLE) and >>>>> raw). >>>>> >>>> >>>> To be practical, it /should/ be a function - or no more than a few >>>> functions. (I don't know why you think functions might be written in >>>> assembly by the compiler author - the compiler author is only going to >>>> provide compiler-assist functions such as division routines, floating >>>> point emulation, etc.) And it should be something that has a clear >>>> algorithm, so no one can "cheat" by using a better algorithm for the >>>> job. >>>> >>>> >>> >>> I am pretty sure I have seen - or read about - compiler generated >>> code where the compiler detects what you want to do and inserts >>> some assembly prewritten piece of code. Was something about CRC >>> or about tcp checksum, not sure - and it was someone who said that, >>> I don't know it from direct experience. >> >> A compiler sees the source code you write, and generates object code >> that does that job. It be smart about it, but it will not insert >> "pre-written assembly code". > Code generation in compilers is usually >> defined with some sort of templates (such a pattern for reading data at >> a register plus offset, or a pattern for doing a shift by a fixed size, >> etc.). They are not "pre-written assembly", in that many of the details >> are determined at generation time, such as registers, instruction >> interleaving, etc. >> >> The nearest you get to pre-written code from the compiler is in the >> compiler support libraries. For example, if the target does not support >> division instructions, or floating point, then the compiler will supply >> routines as needed. These /might/ be written in assembly - but often >> they are written in C. >> >> >> A compiler /will/ detect patterns in your C code and use that to >> generate object code rather than doing a "direct translation". The >> types of patterns it can detect varies - it is one of the things that >> differentiates between compilers. > > We are referring to the same thing under different names - again.
OK. I think your naming and description is odd, but I am glad to see we are getting a better understanding of what the other is saying.
> At the end of the day everything the compiler generates is written > in plain assembly, it must be executable by the CPU. > Under "prewritten" I mean some sort of template which gets filled > with addresses etc. thing before committing.
I think of "prewritten" as referring to larger chunks of assembly code, with much more concrete choices of values, registers, scheduling, etc. You described the "prewritten" code as being easily recognisable - in reality, the majority of the code from modern compilers is generated from very small templates with great variability. And on a processor like the PPC, these will be intertwined with each other according to the best scheduling for the chip. As an example, if we have the function: int foo0(int * p) { int a = *p * *p; return a; } The template for reading "*p" generates lmz 3, 0(3) (Register r3 is used for the first parameter in the PPC eabi. It is also used for the return value from a function, which is why it may seem "over used" in the examples here. In bigger code, and when the compiler can inline functions, it will be more flexible about register choices. I don't know whether you follow the standard PPC eabi in your tools.) Multiplication is another template: mullw 3, 3, 3 As is function exit, in this case just: blr I find it very strange to consider these as "pre-written assembly". And if the function is more complex, the intertwining causes more mixups, making it less "pre-written": int foo1(int * p, int * q) { int a = *p * *p; int b = *q * *q; return a + b; } foo1: lwz 9,0(3) lwz 10,0(4) mullw 9,9,9 mullw 3,10,10 add 3,9,3 blr
> To what lengths the compiler writers go to make common cases look > good know only the writers themselves, my memory is vague but I > do think the guy who said that a few years ago knew what he was > talking about.
Well, it is known to the compiler writers and to users who look at the generated code! Certainly there is plenty of variation between tools, with more advanced compilers working harder at this sort of thing. Command line switches with choices of optimisation levels can also make a big difference. How much experience do you have of using C compilers, and studying their output?
> >> .. A classic example for the PPC would be: >> >> #include <stdint.h> >> >> uint32_t reverseLoad(uint32_t * p) { >> uint32_t x = *p; >> return ((x & 0xff000000) >> 24) >> | ((x & 0x00ff0000) >> 8) >> | ((x & 0x0000ff00) << 8) >> | ((x & 0x000000ff) << 24); >> } >>
<skipping the details>
>> But if we ask for smarter optimisation, with -O2, we get: >> >> reverseLoad: >> lwbrx 3,0,3 >> blr >> >> This is, of course, optimal. (Even the function call overhead will be >> eliminated if the compiler can do so when the function is used.) > > Above all this is a good example how limiting the high level language > is. Just look at the source and then at the final result.
No, that is a good example of how smart the compiler is (or can be) about generating optimal code from the source. You may in addition view this as a limitation of the C language, which has no direct way to specify a "bit reversed pointer". That is fair enough. However, it is not really any harder than defining a function like this, and then using it. For situations where the compiler can't generate ideal code, and it is particularly useful to get such optimal assembly, it is also possible to write a simple little inline assembly function - it is not really any harder than writing the same thing in "normal" assembly. Another option (for newer gcc) is to define the endianness of a struct. Then you can access the fields directly, and the loads and stores will be reversed as needed. typedef struct __attribute__((scalar_storage_order ("little-endian"))) { uint32_t x; } le32_t; uint32_t reverseLoad2(le32_t * p) { return p->x; } reverseLoad2: lwbrx 3,0,3 blr So the high level language gives you a number of options, with specific tools giving more options, and the implementation gives you efficient object code in the end. You might need to define a function or macro yourself, but that is a one-time job.
> > You will get *exactly* the same result (- the return) with no > optimization in vpa from the line: > > mover.l (source),r3
When you say "no optimisation" here, does that mean that VPA supports some kinds of optimisations?
> > Logic optimization is more or less a kindergarten exercise. If you need > logic optimization you don't know what you are doing anyway so the > compiler won't be able to help much, no matter how good.
What do you mean by "logic optimisation" ? It is normal for a good compiler to do a variety of strength reduction and other re-arrangements of code to give you something with the same result, but more efficient execution. And it is a /good/ thing that the compiler does that - it means you can write your source code in the clearest and most maintainable fashion, and let the compiler generate better code. For example, if you have a simple division by a constant: uint32_t divX(uint32_t a) { return a / 5; } The direct translation of this would be: divX: lis 4,5 divwu 3,3,4 blr But a compiler can do better: divX: // divide by 5 lis 9,0xcccc ori 9,9,52429 mulhwu 3,3,9 srwi 3,3,2 blr Such optimisation is certainly not a "kindergarten exercise", and doing it by hand is hardly a maintainable or flexible solution. Changing the denominator to 7 means significant changes: divX: // divide by 7 lis 9,0x2492 ori 9,9,18725 mulhwu 9,3,9 subf 3,9,3 srwi 3,3,1 add 3,9,3 srwi 3,3,2 blr
> > Of course if you stick by a phrase book at source level - as is the case > with *any* high level language - you will need plenty of optimization, > like your example demonstrates.
I still don't know what you mean with "phrase book" here.
> I bet it will will be good only in demo > cases like yours and much less useful in real life,
Nonsense. The benefits of using a higher level language and a compiler get more noticeable with larger code, as the compiler has no problem tracking register usage, instruction scheduling, etc., across large pieces of code - unlike a human. And it has no problem re-creating code in different ways when small details change in the source (such as the divide by 5 and divide by 7 examples).
> so the only benefit > of writing this in C is the source length, 10+ times the necessary (I > counted it and I included a return line in the count, 238 vs. 23 bytes).
You have this completely backwards. If I write a simple example like this, in a manner that is compilable code, then it is going to take longer in high-level source code. But that is the effect of giving that function definition. In use, writing "reverseLoad" does not take significantly more characters than "mover" - and with everything else around, the C code will be much shorter. And this was a case picked specifically to show how some long patterns in C code can be handled by a compiler to generate optimal short assembly sequences. The division example shows the opposite - in C, I write "a / 7", while in assembly you have to write 7 lines (excluding labels and blr). And the C code there is nicer in every way.
> While 10 times more typing may seem no serious issue to many 10 times > higher chance to insert an error is no laughing matter, and 10 times > more obscurity just because of that is a productivity killer.
In real code, the C source will be 10 times shorter than the assembly. And if the assembly has enough comments to make it clear, there is another order of magnitude difference.
> >>> But if the compiler does this it will be obvious enough. >> >> If you had some examples or references, it would be easier to see what >> you mean. >> >>> >>> Anyway, a function would do - if complex and long enough to >>> be close to real life, i.e. a few hundred lines. >> >> A function that is a few hundred lines of source code is /not/ real life >> - it is broken code. Surely in VLA you divide your code into functions >> of manageable size, rather than single massive functions? > > I meant "function" not the in C subroutine kind of sense, I meant it > more as "functionality", i.e. some code doing some job. How it split > into pieces etc. will depend on many factors, language, programmer > style etc., not relevant to this discussion.
OK. But again, it has to be a specific clearly defined and limited functionality. "Write a VNC server" is not a specification - that would take at least many dozens of pages of specifications, not including the details of the interfacing to the network stack, the types of library functions available, the API available to client programs that will "draw" on the server, etc.
> >>> >>> But I don't see why not compare written stuff, I just checked >>> again on that vnc server for dps - not 8k, closer to 11k (the 8k >>> I saw was a half-baked version, no keyboard tables inside it etc.; >>> the complete version also includes a screen mask to allow it >>> to ignore mouse clicks at certain areas, that sort of thing). >>> Add to it some menu (it is command line option driven only), >>> a much more complex menu than windows and android RealVNC has >>> I have and it adds up to 25k. >>> Compare this to the 350k exe for windows or to the 4M for Android >>> (and the android does only raw...) and the picture is clear enough >>> I think. >>> >> >> A VNC server is completely useless for such a test. It is far too >> complex, with far too much variation in implementation and features, too >> many external dependencies on an OS or other software (such as for >> networking), and far too big for anyone to bother with such a comparison. > > Actually I think a comparison between two pieces of code doing the same > thing is quite telling when the difference is in the orders of > magnitude, as in this case.
No, it is not. The code is not comparable in any way, and does not do the same thing except in a very superficial sense. It's like comparing a small car with a train - both can transport you around, but they are very different things, each with their advantages and disadvantages. If you want to compare your VNC server for DPS written in VPA to a VNC server written in C, then you would need to give /exact/ specifications of all the features of your VNC server, and exact details of how it interfaces with everything else in the DPS system, and have someone write a VNC server in C for DPS that follows those same specifications. That would be no small feat - indeed, it would totally impossible unless you wanted to do it yourself. The nearest existing comparison I can think of would be the eCos VNC server, written in C. I can't say how it compares in features with your server, but it has approximately 2100 lines of code, written in a wide style. Since I have no idea about how interfacing with DPS compares with interfacing with eCos (I don't know either system), I have no idea if that is a useful comparison or not.
> Writing small benchmarking toy sort of stuff is a waste of time, I am > interested in end results. > >> >> You specifically need something /small/. > > No, something "small" is kind of kindergarten exercise again, it can > only be good enough to fool someone into believing this or that. > It is end results which count. >
Then we will all remain in ignorance about whether VPA is useful or not, in comparison to developing in C.
Op Sat, 27 May 2017 21:39:36 +0200 schreef rickman <gnuarm@gmail.com>:
> Someone in another group is thinking of using a portable assembler to > write code for an app that would be ported to a number of different > embedded processors including custom processors in FPGAs. I'm wondering > how useful this will be in writing code that will require few changes > across CPU ISAs and manufacturers. > > I am aware that there are many aspects of porting between CPUs that is > assembly language independent, like writing to Flash memory. I'm more > interested in the issues involved in trying to use a universal assembler > to write portable code in general. I'm wondering if it restricts the > instructions you can use or if it works more like a compiler where a > single instruction translates to multiple target instructions when there > is no one instruction suitable. > > Or do I misunderstand how a portable assembler works? Does it require a > specific assembly language source format for each target just like using > the standard assembler for the target?
LLVM has a pretty generic intermediate assembler language, though I'm not sure if it's meant for actually writing code in. http://llvm.org/docs/LangRef.html#instruction-reference Another portable assembly language is Java Bytecode, though it assumes a 32-bit machine. -- (Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
On 02/06/2017 16:03, Boudewijn Dijkstra wrote:
> Op Sat, 27 May 2017 21:39:36 +0200 schreef rickman <gnuarm@gmail.com>: >> Someone in another group is thinking of using a portable assembler to >> write code for an app that would be ported to a number of different >> embedded processors including custom processors in FPGAs. I'm >> wondering how useful this will be in writing code that will require >> few changes across CPU ISAs and manufacturers. >> >> I am aware that there are many aspects of porting between CPUs that is >> assembly language independent, like writing to Flash memory. I'm more >> interested in the issues involved in trying to use a universal >> assembler to write portable code in general. I'm wondering if it >> restricts the instructions you can use or if it works more like a >> compiler where a single instruction translates to multiple target >> instructions when there is no one instruction suitable. >> >> Or do I misunderstand how a portable assembler works? Does it require >> a specific assembly language source format for each target just like >> using the standard assembler for the target? > > LLVM has a pretty generic intermediate assembler language, though I'm > not sure if it's meant for actually writing code in. > > http://llvm.org/docs/LangRef.html#instruction-reference
Interesting, but its not obvious who the audience is. Why would anyone want to learn another language that is not in common use or aligned to any specific CPU?
> Another portable assembly language is Java Bytecode, though it assumes a > 32-bit machine.
I've been watching this thread for some time. My first impression was why not just write in C? So far that impression hasn't changed. Despite the odd line of CPU specific assembler code for those occasions that require it, C is still perhaps the most portable code you can write? -- Mike Perkins Video Solutions Ltd www.videosolutions.ltd.uk
On 05/06/17 16:39, Mike Perkins wrote:
> On 02/06/2017 16:03, Boudewijn Dijkstra wrote: >> Op Sat, 27 May 2017 21:39:36 +0200 schreef rickman <gnuarm@gmail.com>: >>> Someone in another group is thinking of using a portable assembler to >>> write code for an app that would be ported to a number of different >>> embedded processors including custom processors in FPGAs. I'm >>> wondering how useful this will be in writing code that will require >>> few changes across CPU ISAs and manufacturers. >>> >>> I am aware that there are many aspects of porting between CPUs that is >>> assembly language independent, like writing to Flash memory. I'm more >>> interested in the issues involved in trying to use a universal >>> assembler to write portable code in general. I'm wondering if it >>> restricts the instructions you can use or if it works more like a >>> compiler where a single instruction translates to multiple target >>> instructions when there is no one instruction suitable. >>> >>> Or do I misunderstand how a portable assembler works? Does it require >>> a specific assembly language source format for each target just like >>> using the standard assembler for the target? >> >> LLVM has a pretty generic intermediate assembler language, though I'm >> not sure if it's meant for actually writing code in. >> >> http://llvm.org/docs/LangRef.html#instruction-reference > > Interesting, but its not obvious who the audience is. Why would anyone > want to learn another language that is not in common use or aligned to > any specific CPU?
The LLVM "assembly" is intended as an intermediary language. Front-end tools like clang (a C, C++ and Objective-C compiler) generate LLVM assembly. Middle-end tools like optimisers and linkers "play" with it. and back-end tools translate it into target-specific assembly. Each level can do a wide variety of optimisations. The aim is that the whole LLVM system can be more modular and more easily ported to new architectures and new languages than a traditional multi-language multi-target compiler (such as gcc). So LLVM assembly is not an assembly language you would learn or code in - it's the glue holding the whole system together.
> >> Another portable assembly language is Java Bytecode, though it assumes a >> 32-bit machine. > > I've been watching this thread for some time. My first impression was > why not just write in C? So far that impression hasn't changed. Despite > the odd line of CPU specific assembler code for those occasions that > require it, C is still perhaps the most portable code you can write? >
Well, yes - of course C is the sensible option here. Depending on the exact type of code and the targets, Ada, C++, and Forth might also be viable options. But since there is no such thing as "portable assembly", it's a poor choice :-) However, the thread has lead to some interesting discussions, IMHO.
On 6/5/2017 7:39 AM, Mike Perkins wrote:
> On 02/06/2017 16:03, Boudewijn Dijkstra wrote: >> LLVM has a pretty generic intermediate assembler language, though I'm >> not sure if it's meant for actually writing code in. >> >> http://llvm.org/docs/LangRef.html#instruction-reference > > Interesting, but its not obvious who the audience is. Why would anyone want to > learn another language that is not in common use or aligned to any specific CPU?
Esperanto? :>
>> Another portable assembly language is Java Bytecode, though it assumes a >> 32-bit machine. > > I've been watching this thread for some time. My first impression was why not > just write in C? So far that impression hasn't changed. Despite the odd line of > CPU specific assembler code for those occasions that require it, C is still > perhaps the most portable code you can write?
The greater the level of abstraction in a language choice, the less control you have over expressing the minutiae of what you want done. When I design a new processor (typ. application specific), I code up sample algorithms using a very low level set of abstractions... virtual registers, virtual operators, etc. Once I'm done with a number of these, I "eyeball" the "code" and sort out what the instructions (opcodes) should be for the processor. I.e., invent the "assembly language". If I'd coded these algorithms in a HIGHER level language, I'd end up implementing a much more "complex" processor (because it would have to implement much more capable "primitives") C's portability problem isn't with the language, per se, as much as it is with the "practitioners". It could benefit from much stricter type checking and a lot fewer "undefined/implementation-defined behaviors" (cuz it seems folks just get the code working on THEIR target and never see how it fails to execute properly on any OTHER target!)
On Mon, 5 Jun 2017 09:10:10 -0700, Don Y <blockedofcourse@foo.invalid>
wrote:

>C's portability problem isn't with the language, per se, as much as it is >with the "practitioners". It could benefit from much stricter type >checking and a lot fewer "undefined/implementation-defined behaviors" >(cuz it seems folks just get the code working on THEIR target and >never see how it fails to execute properly on any OTHER target!)
The argument always has been that if implementation defined behaviors are locked down, then C would be inefficient on CPUs that don't have good support for <whatever>. Look at the (historical) troubles resulting from Java (initially) requiring IEEE-754 compliance and that FP results be exactly reproducible *both* on the same platform *and* across platforms. No FP hardware fully implements any version of IEEE-754: every chip requires software fixups to achieve compliance, and most fixup suites are not even complete [e.g., ignoring unpopular rounding modes, etc.]. Java FP code ran slower on chips that needed more fixups, and the requirements prevented even implementing a compliant Java on some chips despite their having FP support. Java ultimately had to entirely back away from its reproducibility guarantees. It now requires only best consistency - not exact reproducibility - on the same platform. If you want reproducible results, you have to use software floating point (BigFloat), and accept much slower code. And by requiring consistency, it can only approximate the performance of C code which is likewise compiled. Most C compilers allow to eshew FP consistency for more speed ... Java does not. Of course, FP in general is somewhat less important to this crowd than to other groups, and C has a lot of implementation defined behavior unrelated to FP. But the lesson of trying to lock down hardware (and/or OS) dependent behavior still is important. There is no question that C could do much better type/value and pointer/index checking, but it likely would come at the cost of far more explicit casting (more verbose code), and likely many more runtime checks. A more expressive type system would help [e.g., range integers, etc.], but that would constitute a significant change to the language. Some people point to Ada as an example of a language that can be both "fast" and "safe", but many people (maybe not in this group, but many nonetheless) are unaware that quite a lot of Ada's type/value checks are done at runtime and throw exceptions if they fail. Obviously, a compiler could provide a way to disable the automated runtime checking, and even when enabled checks can be elided if the compiler can statically prove that a given operation will always be safe. But even in Ada with its far more expressive types there are many situations in which the compiler simply can't do that. More stringent languages like ML won't even compile if they can't statically type check the code. In such languages, quite a lot of programmer effort goes toward clubbing the type checker into submission. TANSTAAFL, George