Tim Frink wrote:
>Niklas Holsti wrote:
>>...  if your processor
>>configuration includes scratchpad memory in the Program Memory 
>>block, in addition to I-cache ... perhaps part of your program 
>>code is in the scratchpad; fetching such code should not count as 
>>an I-cache access. ...
> 
> Good hint, this could be a possible reason for that. But, I've already
> checked the disassembled code and all the code is mapped to a cachable
> memory with no scratchpad accesses.

Ah, so it's not that then. Unless the boot-start routine copies a 
part of the code from the cachable memory to the scratchpad to make 
it run faster... perhaps unlikely.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

> Thanks. I'm not intimately familiar with the TriCore or its gcc 
> suite, but one question that comes to mind is if your processor 
> configuration includes scratchpad memory in the Program Memory 
> block, in addition to I-cache. If so, perhaps part of your program 
> code is in the scratchpad; fetching such code should not count as 
> an I-cache access. In your two linking orders (memory layouts) 
> different parts of the code might be placed in the scratchpad area, 
> so the number of instruction cache accesses would also be different 
> for the two layouts.

Good hint, this could be a possible reason for that. But, I've already
checked the disassembled code and all the code is mapped to a cachable
memory with no scratchpad accesses.

Tim

Tim Frink wrote:
>>>I'm experimenting with an instruction set simulator for an 
>>>Infineon DSP.
>>
>>Specifically which processor/chip? Hard to answer without this...
> 
> 
> Thank you for your answer.
> 
> The simulator is from the tricore-gcc suite and is implemented
> for the Infineon TriCore processors.

Thanks. I'm not intimately familiar with the TriCore or its gcc 
suite, but one question that comes to mind is if your processor 
configuration includes scratchpad memory in the Program Memory 
block, in addition to I-cache. If so, perhaps part of your program 
code is in the scratchpad; fetching such code should not count as 
an I-cache access. In your two linking orders (memory layouts) 
different parts of the code might be placed in the scratchpad area, 
so the number of instruction cache accesses would also be different 
for the two layouts.

HTH

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

>> I'm experimenting with an instruction set simulator for an 
>> Infineon DSP.
> 
> Specifically which processor/chip? Hard to answer without this...

Thank you for your answer.

The simulator is from the tricore-gcc suite and is implemented
for the Infineon TriCore processors.

Tim Frink wrote:
> Hi,
> 
> I'm experimenting with an instruction set simulator for an 
> Infineon DSP.

Specifically which processor/chip? Hard to answer without this...

> Each simulation generates a statistics. For the original code I get:
> Total number of executed instructions = 615545
> ...
> Total instruction cache accesses: 615501
> ...
> After arbitrary reordering of the routines the statistics look as follows:
> Total number of executed instructions = 615545
> ...
> Total instruction cache accesses: 39072 
> ...
> Why for the same number of executed instructions the number of
> cache accesses decreased so drastically?

In some DSPs (eg. some Analog Devices chips) the I-cache is not a 
general I-cache, accessed for all instruction fetches, but is 
specialized to be used only when an instruction fetch would cause a 
delay, perhaps because the I-memory bus is used by a concurrent 
operand access. For such processors the number of I-cache accesses 
may be much less than the number of executed instructions. But I 
don't understand how a code-layout change could change the number 
of cache accesses for such caches. So, tell us which chip you are 
simulating, please.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

Hi,

I'm experimenting with an instruction set simulator for an 
Infineon DSP.

The simulator simulates the execution of the code for a
DSP that has an instruction cache (16 KByte, 2-way set associative,
256 Bits/Line, LRU replacement) and the instructions might
be 16 or 32 Bit wide. 

What I do is to compile a program (given in assembler) in its
original version and run it through the simulator. Next, I arbitrarily
reorder the routines in the assembly code, compile the code again and
finally execute it again in the simulator. 

My goal is to see how the cache performance changes after the routines
have been reordered. It is well known that function which call each other
frequently should be mapped close to each other in memory to (possibly)
improve cache behavior. 

Each simulation generates a statistics. For the original code I get:
Total number of executed instructions = 615545
Total number of cycles = 1304469
Instruction cache Hit Rate: 99.98%
Total instruction cache accesses: 615501
Total instruction cache hits: 615364

After arbitrary reordering of the routines the statistics look as follows:
Total number of executed instructions = 615545
Total number of cycles = 727801
Instruction cache Hit Rate: 99.89%
Total instruction cache accesses: 39072 
Total instruction cache hits: 39028

So, the execution time of the reordered code got reduced from 1304469
cycles to 727801 cycles. The reason as can be seen above it the number
of total instruction cache accesses which reduced from 615501 to
39072 (here, each cache access has a latency of 1 cycle). How is this
possible? Why for the same number of executed instructions the number of
cache accesses decreased so drastically?

Best regards,
Tim