On Tue, 04 Feb 2014 10:37:26 +0100, David Brown wrote:

> On 04/02/14 10:15, Tauno Voipio wrote:
>> On 3.2.14 23:37, David Brown wrote:
>>>
>>> I have only 30 years of assembly experience, but I have the same
>>> attitude.  The gcc inline assembly is so well integrated with the
>>> compiler that if you need to use it for a particular odd instruction,
>>> it can happily optimise the rest of the code around it.
>>>
>>> I still think it is very important to be able to /understand/ the
>>> assembly generated by the compiler, although it can be hard with
>>> complicated RISC cpus with lots of registers.  But sometimes for
>>> critical code it is good to look closely at the assembly to see what
>>> is happening, and it can affect the way you write the C code
>>> (especially for less powerful processors).
>> 
>> 
>> That is why I always make the compiler (or actually the toolkit)
>> generate an assembly listing of a compilation. Just add
>> 
>>   -Wa,-ahlms=my_filename.lst
>> 
>> to the GCC command line.
>> 
>> 
> I have the same thing in every Makefile (except that I put the lst files
> in a different directory).  It is highly recommended.

I do that too.  It regularly saves my ass, sometimes by forcing me to 
realize than yes, the compiler just did exactly what I told it to.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

On 04/02/14 10:15, Tauno Voipio wrote:
> On 3.2.14 23:37, David Brown wrote:
>>
>> I have only 30 years of assembly experience, but I have the same
>> attitude.  The gcc inline assembly is so well integrated with the
>> compiler that if you need to use it for a particular odd instruction, it
>> can happily optimise the rest of the code around it.
>>
>> I still think it is very important to be able to /understand/ the
>> assembly generated by the compiler, although it can be hard with
>> complicated RISC cpus with lots of registers.  But sometimes for
>> critical code it is good to look closely at the assembly to see what is
>> happening, and it can affect the way you write the C code (especially
>> for less powerful processors).
> 
> 
> That is why I always make the compiler (or actually the toolkit)
> generate an assembly listing of a compilation. Just add
> 
>   -Wa,-ahlms=my_filename.lst
> 
> to the GCC command line.
> 

I have the same thing in every Makefile (except that I put the lst files
in a different directory).  It is highly recommended.

On 3.2.14 23:37, David Brown wrote:
>
> I have only 30 years of assembly experience, but I have the same
> attitude.  The gcc inline assembly is so well integrated with the
> compiler that if you need to use it for a particular odd instruction, it
> can happily optimise the rest of the code around it.
>
> I still think it is very important to be able to /understand/ the
> assembly generated by the compiler, although it can be hard with
> complicated RISC cpus with lots of registers.  But sometimes for
> critical code it is good to look closely at the assembly to see what is
> happening, and it can affect the way you write the C code (especially
> for less powerful processors).


That is why I always make the compiler (or actually the toolkit)
generate an assembly listing of a compilation. Just add

   -Wa,-ahlms=my_filename.lst

to the GCC command line.

-- 

-TV

On Monday, February 3, 2014 9:58:43 PM UTC+2, upsid...@downunder.com wrote:
> ...
> I once encountered a web page about implementing the memcpy() with
> Pentium processors (apparently assuming virtual memory page and/or
> cache line alignment). Apparently quite high speeds could be achieved
> by first loading as much as possible into the floating point/MMS
> registers available, before storing the data to the destination. 

Don't know about MMS (is that x86?) but on power (603e based flavour
at least) this is definitely the case. Some years (10?) ago when I was
optimizing the window scroll code for DPS this was the fastest of all
(and I did try them all I think). Read 32 64-bit FP registers, then write them.

> One other trick was "touching" every 32 byte cache line and hence
> loading the Src data into cache an then perform a actual fast copy.

That did help, too; I am not sure whether I left it in because the help
was not that huge and the "touch" buffer is core specific so I may
have opted out of using it but it did help all right. I have used that
touch elsewhere on that core since though, was pretty useful for
DSP-ing - and it mattered there as the code could end up using 75+%
of the cpu resources so every relief was welcome.

> Anyway, for fast data transfers you really have to consider data
> alignment, dynamic memory, cache lines, processor pipelines,
> instruction reordering etc.

Alignment matters a lot indeed, one has to do the bulk of the transfer
as aligned as it can get and start/finish it up by handling the
few misaligned bytes.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

On 03/02/14 20:33, Tauno Voipio wrote:
> On 3.2.14 20:35, Tim Wescott wrote:
>> On Mon, 03 Feb 2014 16:14:08 +0100, David Brown wrote:
>>
>>> On 27/01/14 19:26, rxjwg98@gmail.com wrote:
>>>> Hi,
>>>>
>>>> I am learning ARM Cortex-A8 CPU. In order to write optimized assembly
>>>> code, I want to know the instruction scheduling. From A8 TRM, it gives
>>>> the following Table 16-4. I don't know how to use the cycle count, and
>>>> its relationship with source and destination register.
>>>>
>>>> If cycle count is independent, the question is how to use it in
>>>> scheduling.
>>>>
>>>> If cycle count is relevant to the source and destination register, I
>>>> cannot get the cycle number from the pipeline stages from below source
>>>> and destination registers.
>>>>
>>>> Could you explain it to me, experts?
>>>
>>>
>>> My first question here is why are you doing this?  There is a great deal
>>> more involved in performance than cycle counts for instructions -
>>> pipelines, instruction scheduling, caches, prefetches, write buffers,
>>> etc.  Your most important tools here are not the manual, but your system
>>> itself - measure the real-world speed for the particular algorithm you
>>> want to use.
>>>
>>> And why are you writing assembly here?  Have you tried using a
>>> reasonable compiler, with different flag settings and different details
>>> in the source code, and found that the code is too slow for your needs?
>>>
>>> Of course, if you are just doing this for learning or for fun, it's a
>>> different matter - but if you are working on a real application then you
>>> are starting from the wrong end.
>>
>> Just to add to what David is saying: hand-writing assembly code for high
>> speed used to make a lot of sense, in the days when compilers did not
>> optimize very well.
>>
>> That's not the case any more.  Unless you have some oddball corner-case,
>> such as a compiler that does not know how to efficiently use some
>> instruction or set of instructions on the processor, there's no point in
>> doing things in assembly.
>>
>> The last time I wrote assembly and got significant code speed improvement
>> was for an TMS320F2812 DSP processor, because Code Composter couldn't
>> seem to cough up a one-cycle-per-multiply loop using hardware looping and
>> the MAC instruction.  That was well over 10 years ago, and the only
>> reason I did the code writing was because I wanted to do something that
>> didn't quite fit with the library code that TI provided.
>>
>> I may do it again, if I find that the gnu compiler can't figure out how
>> to efficiently use the MAC instructions in the Cortex M4 core -- time
>> will tell, but I'm figuring I have even odds of giving up on the compiler
>> and doing things in assembly (and you can bet that I'll ask here to see
>> if there's some magic, if it doesn't work for me right off).
>
>
> Rest assured, at least when optimizing for size (-Os), the GNU
> compiler (v 4.7.x) does use the MAC instructions. There is little
> to be gained (and plenty to lose) with hand coding for an ARM.
>
> The current RISC -based cores are a PITA to program in assembly
> language, and it is not intended, either. It is up to the compiler
> writers to handle the intricacies of the instruction set.
>
> Despite of nearly 50 years of assembler programming, I have left
> the assembly code to GCC, with few exceptions which can be handled
> with the embedded assembly code handling of GCC.
>

I have only 30 years of assembly experience, but I have the same 
attitude.  The gcc inline assembly is so well integrated with the 
compiler that if you need to use it for a particular odd instruction, it 
can happily optimise the rest of the code around it.

I still think it is very important to be able to /understand/ the 
assembly generated by the compiler, although it can be hard with 
complicated RISC cpus with lots of registers.  But sometimes for 
critical code it is good to look closely at the assembly to see what is 
happening, and it can affect the way you write the C code (especially 
for less powerful processors).

On Mon, 03 Feb 2014 12:35:59 -0600, Tim Wescott
<tim@seemywebsite.really> wrote:

>Just to add to what David is saying: hand-writing assembly code for high 
>speed used to make a lot of sense, in the days when compilers did not 
>optimize very well.
>
>That's not the case any more.  Unless you have some oddball corner-case, 
>such as a compiler that does not know how to efficiently use some 
>instruction or set of instructions on the processor, there's no point in 
>doing things in assembly.

I once encountered a web page about implementing the memcpy() with
Pentium processors (apparently assuming virtual memory page and/or
cache line alignment). Apparently quite high speeds could be achieved
by first loading as much as possible into the floating point/MMS
registers available, before storing the data to the destination. 

One other trick was "touching" every 32 byte cache line and hence
loading the Src data into cache an then perform a actual fast copy.

Unfortunately, I do not remember the link to that page.

Anyway, for fast data transfers you really have to consider data
alignment, dynamic memory, cache lines, processor pipelines,
instruction reordering etc.

This is far more demanding than trying to optimize how many PDP-11
integer instruction you could squeeze between PDP-11 floating point
instructions :-)

On 3.2.14 20:35, Tim Wescott wrote:
> On Mon, 03 Feb 2014 16:14:08 +0100, David Brown wrote:
>
>> On 27/01/14 19:26, rxjwg98@gmail.com wrote:
>>> Hi,
>>>
>>> I am learning ARM Cortex-A8 CPU. In order to write optimized assembly
>>> code, I want to know the instruction scheduling. From A8 TRM, it gives
>>> the following Table 16-4. I don't know how to use the cycle count, and
>>> its relationship with source and destination register.
>>>
>>> If cycle count is independent, the question is how to use it in
>>> scheduling.
>>>
>>> If cycle count is relevant to the source and destination register, I
>>> cannot get the cycle number from the pipeline stages from below source
>>> and destination registers.
>>>
>>> Could you explain it to me, experts?
>>
>>
>> My first question here is why are you doing this?  There is a great deal
>> more involved in performance than cycle counts for instructions -
>> pipelines, instruction scheduling, caches, prefetches, write buffers,
>> etc.  Your most important tools here are not the manual, but your system
>> itself - measure the real-world speed for the particular algorithm you
>> want to use.
>>
>> And why are you writing assembly here?  Have you tried using a
>> reasonable compiler, with different flag settings and different details
>> in the source code, and found that the code is too slow for your needs?
>>
>> Of course, if you are just doing this for learning or for fun, it's a
>> different matter - but if you are working on a real application then you
>> are starting from the wrong end.
>
> Just to add to what David is saying: hand-writing assembly code for high
> speed used to make a lot of sense, in the days when compilers did not
> optimize very well.
>
> That's not the case any more.  Unless you have some oddball corner-case,
> such as a compiler that does not know how to efficiently use some
> instruction or set of instructions on the processor, there's no point in
> doing things in assembly.
>
> The last time I wrote assembly and got significant code speed improvement
> was for an TMS320F2812 DSP processor, because Code Composter couldn't
> seem to cough up a one-cycle-per-multiply loop using hardware looping and
> the MAC instruction.  That was well over 10 years ago, and the only
> reason I did the code writing was because I wanted to do something that
> didn't quite fit with the library code that TI provided.
>
> I may do it again, if I find that the gnu compiler can't figure out how
> to efficiently use the MAC instructions in the Cortex M4 core -- time
> will tell, but I'm figuring I have even odds of giving up on the compiler
> and doing things in assembly (and you can bet that I'll ask here to see
> if there's some magic, if it doesn't work for me right off).


Rest assured, at least when optimizing for size (-Os), the GNU
compiler (v 4.7.x) does use the MAC instructions. There is little
to be gained (and plenty to lose) with hand coding for an ARM.

The current RISC -based cores are a PITA to program in assembly
language, and it is not intended, either. It is up to the compiler
writers to handle the intricacies of the instruction set.

Despite of nearly 50 years of assembler programming, I have left
the assembly code to GCC, with few exceptions which can be handled
with the embedded assembly code handling of GCC.

-- 

-Tauno Voipio

On Mon, 03 Feb 2014 16:14:08 +0100, David Brown wrote:

> On 27/01/14 19:26, rxjwg98@gmail.com wrote:
>> Hi,
>> 
>> I am learning ARM Cortex-A8 CPU. In order to write optimized assembly
>> code, I want to know the instruction scheduling. From A8 TRM, it gives
>> the following Table 16-4. I don't know how to use the cycle count, and
>> its relationship with source and destination register.
>> 
>> If cycle count is independent, the question is how to use it in
>> scheduling.
>> 
>> If cycle count is relevant to the source and destination register, I
>> cannot get the cycle number from the pipeline stages from below source
>> and destination registers.
>> 
>> Could you explain it to me, experts?
> 
> 
> My first question here is why are you doing this?  There is a great deal
> more involved in performance than cycle counts for instructions -
> pipelines, instruction scheduling, caches, prefetches, write buffers,
> etc.  Your most important tools here are not the manual, but your system
> itself - measure the real-world speed for the particular algorithm you
> want to use.
> 
> And why are you writing assembly here?  Have you tried using a
> reasonable compiler, with different flag settings and different details
> in the source code, and found that the code is too slow for your needs?
> 
> Of course, if you are just doing this for learning or for fun, it's a
> different matter - but if you are working on a real application then you
> are starting from the wrong end.

Just to add to what David is saying: hand-writing assembly code for high 
speed used to make a lot of sense, in the days when compilers did not 
optimize very well.

That's not the case any more.  Unless you have some oddball corner-case, 
such as a compiler that does not know how to efficiently use some 
instruction or set of instructions on the processor, there's no point in 
doing things in assembly.

The last time I wrote assembly and got significant code speed improvement 
was for an TMS320F2812 DSP processor, because Code Composter couldn't 
seem to cough up a one-cycle-per-multiply loop using hardware looping and 
the MAC instruction.  That was well over 10 years ago, and the only 
reason I did the code writing was because I wanted to do something that 
didn't quite fit with the library code that TI provided.

I may do it again, if I find that the gnu compiler can't figure out how 
to efficiently use the MAC instructions in the Cortex M4 core -- time 
will tell, but I'm figuring I have even odds of giving up on the compiler 
and doing things in assembly (and you can bet that I'll ask here to see 
if there's some magic, if it doesn't work for me right off).

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

On Monday, January 27, 2014 8:37:06 PM UTC+2, Robert Willy wrote:
> ...
> If cycle count is relevant to the source and destination register,
> I cannot get the cycle number from the pipeline stages from below
> source and destination registers.

Cycle counting on pipelined processors is not very practical.
I don't know ARM, I use power processors - they specify "latencies".
But you will find out that things depend on more than such latencies,
e.g. data dependencies (you need the result from an operation to initiate
the next one, say like in multiply-add; so even if the achievable
throughput is 1 instruction/cycle if you try to accumulate in the
same register and you have a 6 stage pipeline this will mean 6 cycles
per multiply-add, you will have to figure out how to do the programming).

Basically if you write using assembler (with the crippled RISC mnemonics
this is not such a good idea but you don't have many choices, not for
ARM at least) you will have to live with the cycle count as it is,
you can't influence that; what you can influence is the order of the
opcodes, like spreading opcodes such that needed results of previous
operations are used as late as practical etc.

Dimiter

------------------------------------------------------
Dimiter Popoff, TGI             http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

On 27/01/14 19:26, rxjwg98@gmail.com wrote:
> Hi,
> 
> I am learning ARM Cortex-A8 CPU. In order to write optimized assembly
> code, I want to know the instruction scheduling. From A8 TRM, it
> gives the following Table 16-4. I don't know how to use the cycle
> count, and its relationship with source and destination register.
> 
> If cycle count is independent, the question is how to use it in
> scheduling.
> 
> If cycle count is relevant to the source and destination register, I
> cannot get the cycle number from the pipeline stages from below
> source and destination registers.
> 
> Could you explain it to me, experts?

My first question here is why are you doing this?  There is a great deal
more involved in performance than cycle counts for instructions -
pipelines, instruction scheduling, caches, prefetches, write buffers,
etc.  Your most important tools here are not the manual, but your system
itself - measure the real-world speed for the particular algorithm you
want to use.

And why are you writing assembly here?  Have you tried using a
reasonable compiler, with different flag settings and different details
in the source code, and found that the code is too slow for your needs?

Of course, if you are just doing this for learning or for fun, it's a
different matter - but if you are working on a real application then you
are starting from the wrong end.

> 
> 
> Thanks,
> 
> 
> 
> 
> 
> DDI0344I_cortex-a8_r3p1_trm.pdf: Table 16-4 Multiply instructions 
> Multiply type      Cycles Source1 Source2 Source3     Source4
> Result1 Result2 Normal: MUL        2      Rm:E1   Rs:E1   [Rd:E3]
> {Rn:E4}a    Rd:E5 - Long: SMULL, UMULL 3      Rm:E1   Rs:E1
> {[RdLo:E3]} {[RdHi:E3]} RdLo:E5 RdHi:E5 Long: SMLAL, UMLAL 3
> Rm:E1   Rs:E1   {[RdLo:E2]} {[RdHi:E1]} RdLo:E5 RdHi:E5 Halfword:
> SMLAxy,  2      Rm:E1   Rs:E1   [Rd:E2]     {Rn:E4}a    Rd:E5 -
>