I want to understand impact of buses/wait states on Cortex-M preformance. It seems that documentation about this seem to be scatterd or unavailable (I would appreciate pointers to appropriate documentation). I was unable to find answers in documentation so I did some testing. Below part of my results. I run tests on STM32F030, STM32F103, STM32F407, STM32411 and two chinese clones, namly CKS32F103 and Air32F103. Let me mention from the start, that chinese clones show quite different results from STM32F103. My first test was delay loop (I needed it for some other tests, but even alone it gives some info). In assembler (GNU as) it is: counted_delay: sub r0, #1 bgt counted_delay bx lr I also did four test that flip bits in GPIO ports. One is: pin_test1: ldr r1, [pc, #16] movs r2, #0x1 str r2, [r1] ldr r1, [pc, #12] movs r2, #0x0 str r2, [r1] sub r0, #1 bgt pin_test1 bx lr .balign 4 .long 0x42210198 This is strightforward code, I expect gcc to generate code like this. This one uses access via bit-band region to set single bit in output register. The second test uses the same code, but just writes to output register (so it is setting all bits confugured as output). Third test used improved loop, to avoid repeatedly loading constants to registers: pin_test3: ldr r1, [pc, #0x00C] movs r2, #0x1 movs r3, #0x0 pin_testl3: str r2, [r1] sub r0, #1 str r3, [r1] bgt pin_testl3 bx lr .long 0x42210198 Again this one used bit-band region. Fourth test was like third, but did full write to output register. I run output tests only on F103 compatible processors. For convenience I run most test in RAM. The results are below, all time in clocks. Note: I measured time reading systick counter. There is some constant overhead/inaccuracy but it looks that for given count time is small_constat + count*coeff where coeff is in table and count means repetition count of the loop. delay pin1 pin2 pin3 pin4 STM32F103 ram 4 28 18 23 12 CKS32F103 ram 6 29 22 24 14 Air32F103 ram 4 22 14 19 8 STM32F103 flash 6 2 wait states STM32F103 flash 3 0 wait states STM32F030 ram 4 STM32F407 ram1 6 STM32F407 ram0 3 For STM32F407 ram1 means first ram bank at default location, ram0 means first ram bank remapped to address 0. On STM32F401 I got the same results as STM32F407. Now, already delay loop raises some questions: STM claims that RAM is zero wait states, but from the timings we see that on STM32F103 we effectively get 1 wait state, compared to 0 wait state flash. OTOH 2 wait state flash actually causes loss of 3 cycles. One guess was that with 2 wait states delay loop may be bandwidt limited: each jump seem to cause two accesses to flash due to prefetch and they need 6 clocks. But disabling flash prefetch still gives 6 clocks (it changed other timings). Also, for CKS32F103 and STM32F407 penalty compared to optimal case is 3 clocks. STM32F030 is unremarkable here, time is exactly as ARM docs says. Now the busses: ARM docs says that Cortex M3/M4 has three buses. In area of STM RAM core uses "system bus" which has some buffering. When executing from lower addresses (flash or remapped RAM) core uses "code bus" for instruction fetches and "idata bus" for data accesses. Clearly forcing all accesses on single bus is suboptimal, but for delay loop alone it should not matter: delay loop only fetches instructions, all work is done in registers. ARM says that system bus is "buffered", and the other unbuffered, but it is rather unclear why/if this should impact timings. IIUC bit-band access uses read-modify-write sequence and probably the whole sequence keep exclusive use "system bus" during execution. Since core is fetching instructions on "system bus" this must slow down execution. Compared to simple writes bit-band access seem to cause overhead of order 8-11 clocks. There are two accesses per iteration, so overhead for single access seem to be 4-6 clycles. I must admit that this looks suprisingly high. Gain from moving loading of constants outside loop is almost as expected: two memory fetches each needing 2 clocks and two single clock instrictions together give 6 clocks, which is several cases agrees with measured results. But there are few discrepancies. It may be of same interest that on this very artifical test the 3 F103-alikes show quite different performance, with CKS32F103 the slowest, Air32F103 fastest and STM32F103 in the middle. -- Waldek Hebisch
Cortex-M buses
Started by ●December 29, 2022