Hi, I'm experimenting with an instruction set simulator for an Infineon DSP. The simulator simulates the execution of the code for a DSP that has an instruction cache (16 KByte, 2-way set associative, 256 Bits/Line, LRU replacement) and the instructions might be 16 or 32 Bit wide. What I do is to compile a program (given in assembler) in its original version and run it through the simulator. Next, I arbitrarily reorder the routines in the assembly code, compile the code again and finally execute it again in the simulator. My goal is to see how the cache performance changes after the routines have been reordered. It is well known that function which call each other frequently should be mapped close to each other in memory to (possibly) improve cache behavior. Each simulation generates a statistics. For the original code I get: Total number of executed instructions = 615545 Total number of cycles = 1304469 Instruction cache Hit Rate: 99.98% Total instruction cache accesses: 615501 Total instruction cache hits: 615364 After arbitrary reordering of the routines the statistics look as follows: Total number of executed instructions = 615545 Total number of cycles = 727801 Instruction cache Hit Rate: 99.89% Total instruction cache accesses: 39072 Total instruction cache hits: 39028 So, the execution time of the reordered code got reduced from 1304469 cycles to 727801 cycles. The reason as can be seen above it the number of total instruction cache accesses which reduced from 615501 to 39072 (here, each cache access has a latency of 1 cycle). How is this possible? Why for the same number of executed instructions the number of cache accesses decreased so drastically? Best regards, Tim
Cache access
Started by ●January 4, 2008
Reply by ●January 6, 20082008-01-06
Tim Frink wrote:> Hi, > > I'm experimenting with an instruction set simulator for an > Infineon DSP.Specifically which processor/chip? Hard to answer without this...> Each simulation generates a statistics. For the original code I get: > Total number of executed instructions = 615545 > ... > Total instruction cache accesses: 615501 > ... > After arbitrary reordering of the routines the statistics look as follows: > Total number of executed instructions = 615545 > ... > Total instruction cache accesses: 39072 > ... > Why for the same number of executed instructions the number of > cache accesses decreased so drastically?In some DSPs (eg. some Analog Devices chips) the I-cache is not a general I-cache, accessed for all instruction fetches, but is specialized to be used only when an instruction fetch would cause a delay, perhaps because the I-memory bus is used by a concurrent operand access. For such processors the number of I-cache accesses may be much less than the number of executed instructions. But I don't understand how a code-layout change could change the number of cache accesses for such caches. So, tell us which chip you are simulating, please. -- Niklas Holsti Tidorum Ltd niklas holsti tidorum fi . @ .
Reply by ●January 7, 20082008-01-07
>> I'm experimenting with an instruction set simulator for an >> Infineon DSP. > > Specifically which processor/chip? Hard to answer without this...Thank you for your answer. The simulator is from the tricore-gcc suite and is implemented for the Infineon TriCore processors.
Reply by ●January 7, 20082008-01-07
Tim Frink wrote:>>>I'm experimenting with an instruction set simulator for an >>>Infineon DSP. >> >>Specifically which processor/chip? Hard to answer without this... > > > Thank you for your answer. > > The simulator is from the tricore-gcc suite and is implemented > for the Infineon TriCore processors.Thanks. I'm not intimately familiar with the TriCore or its gcc suite, but one question that comes to mind is if your processor configuration includes scratchpad memory in the Program Memory block, in addition to I-cache. If so, perhaps part of your program code is in the scratchpad; fetching such code should not count as an I-cache access. In your two linking orders (memory layouts) different parts of the code might be placed in the scratchpad area, so the number of instruction cache accesses would also be different for the two layouts. HTH -- Niklas Holsti Tidorum Ltd niklas holsti tidorum fi . @ .
Reply by ●January 7, 20082008-01-07
> Thanks. I'm not intimately familiar with the TriCore or its gcc > suite, but one question that comes to mind is if your processor > configuration includes scratchpad memory in the Program Memory > block, in addition to I-cache. If so, perhaps part of your program > code is in the scratchpad; fetching such code should not count as > an I-cache access. In your two linking orders (memory layouts) > different parts of the code might be placed in the scratchpad area, > so the number of instruction cache accesses would also be different > for the two layouts.Good hint, this could be a possible reason for that. But, I've already checked the disassembled code and all the code is mapped to a cachable memory with no scratchpad accesses. Tim
Reply by ●January 7, 20082008-01-07
Tim Frink wrote:>Niklas Holsti wrote: >>... if your processor >>configuration includes scratchpad memory in the Program Memory >>block, in addition to I-cache ... perhaps part of your program >>code is in the scratchpad; fetching such code should not count as >>an I-cache access. ... > > Good hint, this could be a possible reason for that. But, I've already > checked the disassembled code and all the code is mapped to a cachable > memory with no scratchpad accesses.Ah, so it's not that then. Unless the boot-start routine copies a part of the code from the cachable memory to the scratchpad to make it run faster... perhaps unlikely. -- Niklas Holsti Tidorum Ltd niklas holsti tidorum fi . @ .