JetBrain Embedded Development Trends

Cortex-M buses

Started by Unknown December 29, 2022
I want to understand impact of buses/wait states on Cortex-M
preformance.  It seems that documentation about this seem
to be scatterd or unavailable (I would appreciate pointers
to appropriate documentation).  I was unable to find answers
in documentation so I did some testing.  Below part of
my results.  I run tests on STM32F030, STM32F103, STM32F407,
STM32411 and two chinese clones, namly CKS32F103 and
Air32F103.  Let me mention from the start, that chinese
clones show quite different results from STM32F103.

My first test was delay loop (I needed it for some other
tests, but even alone it gives some info).  In assembler
(GNU as) it is:

    sub     r0, #1
    bgt     counted_delay
    bx      lr

I also did four test that flip bits in GPIO ports.  One is:

    ldr     r1, [pc, #16]
    movs    r2, #0x1
    str     r2, [r1]
    ldr     r1, [pc, #12]
    movs    r2, #0x0
    str     r2, [r1]
    sub     r0, #1
    bgt     pin_test1
    bx      lr
.balign 4
.long 0x42210198

This is strightforward code, I expect gcc to generate code like
this.  This one uses access via bit-band region to set single
bit in output register.  The second test uses the same code, but
just writes to output register (so it is setting all bits
confugured as output).

Third test used improved loop, to avoid repeatedly loading
constants to registers:

   ldr     r1, [pc, #0x00C]
   movs    r2, #0x1
   movs    r3, #0x0
   str     r2, [r1]
   sub     r0, #1
   str     r3, [r1]
   bgt     pin_testl3
   bx      lr
.long 0x42210198

Again this one used bit-band region.  Fourth test was like
third, but did full write to output register.

I run output tests only on F103 compatible processors.  For
convenience I run most test in RAM.  The results are below,
all time in clocks.  Note: I measured time reading systick
counter.  There is some constant overhead/inaccuracy but
it looks that for given count time is small_constat + count*coeff
where coeff is in table and count means repetition count of
the loop.

                delay pin1 pin2 pin3 pin4
STM32F103 ram     4    28   18   23   12
CKS32F103 ram     6    29   22   24   14
Air32F103 ram     4    22   14   19    8
STM32F103 flash   6
2 wait states
STM32F103 flash   3
0 wait states
STM32F030 ram     4
STM32F407 ram1    6
STM32F407 ram0    3

For STM32F407 ram1 means first ram bank at default location,
ram0 means first ram bank remapped to address 0.  On STM32F401
I got the same results as STM32F407.

Now, already delay loop raises some questions: STM claims that
RAM is zero wait states, but from the timings we see that on
STM32F103 we effectively get 1 wait state, compared to 0 wait
state flash.  OTOH 2 wait state flash actually causes loss of
3 cycles.  One guess was that with 2 wait states delay loop
may be bandwidt limited: each jump seem to cause two accesses
to flash due to prefetch and they need 6 clocks.  But disabling
flash prefetch still gives 6 clocks (it changed other timings).
Also, for CKS32F103 and STM32F407 penalty compared to optimal
case is 3 clocks.  STM32F030 is unremarkable here, time is
exactly as ARM docs says.

Now the busses: ARM docs says that Cortex M3/M4 has three buses.
In area of STM RAM core uses "system bus" which has some buffering.
When executing from lower addresses (flash or remapped RAM)
core uses "code bus" for instruction fetches and "idata bus" for
data accesses.  Clearly forcing all accesses on single bus
is suboptimal, but for delay loop alone it should not matter:
delay loop only fetches instructions, all work is done in
registers.  ARM says that system bus is "buffered", and the
other unbuffered, but it is rather unclear why/if this should
impact timings.

IIUC bit-band access uses read-modify-write sequence and probably
the whole sequence keep exclusive use "system bus" during execution.
Since core is fetching instructions on "system bus" this must
slow down execution.  Compared to simple writes bit-band access
seem to cause overhead of order 8-11 clocks.  There are two
accesses per iteration, so overhead for single access seem to
be 4-6 clycles.  I must admit that this looks suprisingly high.

Gain from moving loading of constants outside loop is almost
as expected: two memory fetches each needing 2 clocks and
two single clock instrictions together give 6 clocks, which
is several cases agrees with measured results.  But there
are few discrepancies.

It may be of same interest that on this very artifical test
the 3 F103-alikes show quite different performance, with
CKS32F103 the slowest, Air32F103 fastest and STM32F103 in
the middle.

                              Waldek Hebisch

JetBrain Embedded Development Trends