Re: [lpc2100] Simple test program - is now instruction pipeline/VPB question

Started by microbit November 22, 2003
Hi Leon,
 
I'm "on the air" now too with LPC2106 !
 
Here is a prelim. current consumption figure I took :
 
I measure 6.5 mA @ 10 MHz setting and clearing P0.0 in a loop
(PLL default bypassed and 1 cclk / fetch) executing out of Flash.
Pretty impressive !
 
Trying to execute this as fast as possible has lifted up the veil a bit better on some of the instructions,
but there is still a mystery, so a question for people that are much more intimate with ARM7/LPC,
( I don't feel like asking an FAE that won't know an answer anyway  and I can't find it in the
ARM7TDMI ref manual) :
 
I noticed it seems the VPB bus either causes inserted NOPs in the pipeline, or wait states are
automatically generated when I "write too fast" to the VPB bus.
Furthermore , I'm not even generating Read/Modify/Write instructions with my test C code !!!! ????
 
The 2nd question is, what if I write at a slower rate to VPB ?
Do I still need the fastest pclk for my I/O pins to update as fast as possible ?????
(I can't trace :-)
 
As an example, C code generating this sequence :
.......
STR     R4,[R7,#0]            P0.0 set to "1"
MOV    R4,#1
STR     R4,[R0,#0]            P0.0 set to "0"
......
 
Takes :
1.6 uS (16 cclks) with pclk = 4 cclks
1.0 uS (10 cclks) with pclk = 2 cclks
0.8 uS (8 cclks)  with pclk = 1 cclks
 
(that's what I measure here on HW)
 
Is there anyone that can shed some light on this ?
 
 
 
toggling P0.0 with pclk = cclk/4
 
 
-- Kris
www.microbit.com.au
 
 
----- Original Message -----
From: "Leon Heller" <l...@hotmail.com>
To: <l...@yahoogroups.com>
Sent: Saturday, November 22, 2003 9:49 PM
Subject: [lpc2100] Simple test program




An Engineer's Guide to the LPC2100 Series

At 05:07 AM 11/23/03 +1100, you wrote:
>The 2nd question is, what if I write at a slower rate to VPB ?
>Do I still need the fastest pclk for my I/O pins to update as fast as
>possible ?????
>(I can't trace :-)
>
>As an example, C code generating this sequence :
>.......
>STR R4,[R7,#0] P0.0 set to "1"
>MOV R4,#1
>STR R4,[R0,#0] P0.0 set to "0"
>......
>
>Takes :
>1.6 uS (16 cclks) with pclk = 4 cclks
>1.0 uS (10 cclks) with pclk = 2 cclks
>0.8 uS (8 cclks) with pclk = 1 cclks
>
>(that's what I measure here on HW)
>
>Is there anyone that can shed some light on this ? >
>toggling P0.0 with pclk = cclk/4 >-- Kris
><http://www.microbit.com.au>www.microbit.com.au
>

I was going to play with bus optimization next anyway so I thought I'd
measure the results and pass them along.

All of these with a 10MHz clock PLL'd to 60MHz.

MAM Off
ASM optimized 1.06uS period ~740nS on ~330nS off
C 1.8uS period ~800ns on ~1uS off

MAM on, Access to flash at recommended 3 cycles, VPB divider at default.
ASM optimized near square wave with 600nS period
C near square wave with 736nS period

MAM on , Access to flash at recommended 3 cycles, , VPB divider to 1
ASM optimized 264nS period ~168nS off ~118nS on
C near square wave with 416nS period

The (hand) optimized assembly loop used is

mov r3, #256
ldr r2, .L67+32
ldr r4, .L67+36
.L64:
str r3, [r2, #0]
str r3, [r4, #0]
b .L64

If the output is instruction rate limited then I would expect an output
with an approx 2/3 duty cycle. That is only approached for the first
case. For all other cases there is clearly some time taken up with I/O.

Also clearly getting maximum throughput will depend on setting up the bus
'correctly'.

Setting the VPB divider to 1 in this configuration also seems to have an
effect on the UART. I haven't figured that out yet but what should be
9600 baud drops to about 9000 baud.

Robert Adsett



Hi Robert,

Thanks for that, it's a strange one.
Do you know if maybe NOPs are inserted in the pipe ?
I can't figure it out.

Also, do you know if we have faster access executing from RAM ?
I assume the bus is the bottleneck anyway.

The VPB bottleneck has me completely stumped.
I can't figure why it affects IO for starters anyway...... (or UART for that
matter)
Some more detective work might be needed :-)

All the best,
Kris
www.microbit.com.au ----- Original Message -----
From: "Robert Adsett" <>
To: <>
Sent: Sunday, November 23, 2003 9:47 AM
Subject: Re: [lpc2100] Simple test program - is now instruction pipeline/VPB
question > At 05:07 AM 11/23/03 +1100, you wrote:
> >The 2nd question is, what if I write at a slower rate to VPB ?
> >Do I still need the fastest pclk for my I/O pins to update as fast as
> >possible ?????
> >(I can't trace :-)
> >
> >As an example, C code generating this sequence :
> >.......
> >STR R4,[R7,#0] P0.0 set to "1"
> >MOV R4,#1
> >STR R4,[R0,#0] P0.0 set to "0"
> >......
> >
> >Takes :
> >1.6 uS (16 cclks) with pclk = 4 cclks
> >1.0 uS (10 cclks) with pclk = 2 cclks
> >0.8 uS (8 cclks) with pclk = 1 cclks
> >
> >(that's what I measure here on HW)
> >
> >Is there anyone that can shed some light on this ?
> >
> >
> >
> >toggling P0.0 with pclk = cclk/4
> >
> >
> >-- Kris
> ><http://www.microbit.com.au>www.microbit.com.au
> >
>
> I was going to play with bus optimization next anyway so I thought I'd
> measure the results and pass them along.
>
> All of these with a 10MHz clock PLL'd to 60MHz.
>
> MAM Off
> ASM optimized 1.06uS period ~740nS on ~330nS off
> C 1.8uS period ~800ns on ~1uS off
>
> MAM on, Access to flash at recommended 3 cycles, VPB divider at default.
> ASM optimized near square wave with 600nS period
> C near square wave with 736nS period
>
> MAM on , Access to flash at recommended 3 cycles, , VPB divider to 1
> ASM optimized 264nS period ~168nS off ~118nS on
> C near square wave with 416nS period
>
> The (hand) optimized assembly loop used is
>
> mov r3, #256
> ldr r2, .L67+32
> ldr r4, .L67+36
> .L64:
> str r3, [r2, #0]
> str r3, [r4, #0]
> b .L64
>
> If the output is instruction rate limited then I would expect an output
> with an approx 2/3 duty cycle. That is only approached for the first
> case. For all other cases there is clearly some time taken up with I/O.
>
> Also clearly getting maximum throughput will depend on setting up the bus
> 'correctly'.
>
> Setting the VPB divider to 1 in this configuration also seems to have an
> effect on the UART. I haven't figured that out yet but what should be
> 9600 baud drops to about 9000 baud.
>
> Robert Adsett
>
>
>
>
>
>
>
> To unsubscribe from this group, send an email to:
>
>
>
>
>
>





Also,

> MAM Off
> ASM optimized 1.06uS period ~740nS on ~330nS off
> C 1.8uS period ~800ns on ~1uS off

Use a long* to IOCLR and IOSET for C , it saves 4 cclks in my case in the
for(;;) loop.
Instead of the LDR R4,[PC,#<LITERAL_OFFSET>]
the STR instruction is indeed then used with a register load.

-- Kris


At 09:57 AM 11/23/03 +1100, you wrote:
>Hi Robert,
>
>Thanks for that, it's a strange one.
>Do you know if maybe NOPs are inserted in the pipe ?
>I can't figure it out.

I would expect that the VPB peripheral responsible is just asserting wait
states. Modifying the core to insert NOPs into the pipe would be a larger
task and I don't see very many benefits. >Also, do you know if we have faster access executing from RAM ?
>I assume the bus is the bottleneck anyway.

If it was limited by the speed of fetching instructions then I would expect
a 1/3 - 2/3 duty cycle on the output. I only get that for the default case
with the MAM turned off and flash access set to the default 7 cycles. Once
the MAM is enabled it seems likely that the three instructions in the loop
are maintained in the MAM's cache and so have no access delay. There is
asymmetry but I think the shape is dominated by the output peripheral.

>The VPB bottleneck has me completely stumped.

I suppose it might be the VPB but I suspect it's the actual I/O peripheral
that's speed limited.

>I can't figure why it affects IO for starters anyway...... (or UART for that
>matter)
>Some more detective work might be needed :-)

The UART has me puzzled. I'm going to do some more investigation and I've
posted a question to Philips forum to see if there is some frequency
limitation I've overlooked or is undocumented.

Robert Adsett >All the best,
>Kris
>www.microbit.com.au
>
>
>----- Original Message -----
>From: "Robert Adsett" <>
>To: <>
>Sent: Sunday, November 23, 2003 9:47 AM
>Subject: Re: [lpc2100] Simple test program - is now instruction pipeline/VPB
>question
>
>
> > At 05:07 AM 11/23/03 +1100, you wrote:
> > >The 2nd question is, what if I write at a slower rate to VPB ?
> > >Do I still need the fastest pclk for my I/O pins to update as fast as
> > >possible ?????
> > >(I can't trace :-)
> > >
> > >As an example, C code generating this sequence :
> > >.......
> > >STR R4,[R7,#0] P0.0 set to "1"
> > >MOV R4,#1
> > >STR R4,[R0,#0] P0.0 set to "0"
> > >......
> > >
> > >Takes :
> > >1.6 uS (16 cclks) with pclk = 4 cclks
> > >1.0 uS (10 cclks) with pclk = 2 cclks
> > >0.8 uS (8 cclks) with pclk = 1 cclks
> > >
> > >(that's what I measure here on HW)
> > >
> > >Is there anyone that can shed some light on this ?
> > >
> > >
> > >
> > >toggling P0.0 with pclk = cclk/4
> > >
> > >
> > >-- Kris
> > ><http://www.microbit.com.au>www.microbit.com.au
> > >
> >
> > I was going to play with bus optimization next anyway so I thought I'd
> > measure the results and pass them along.
> >
> > All of these with a 10MHz clock PLL'd to 60MHz.
> >
> > MAM Off
> > ASM optimized 1.06uS period ~740nS on ~330nS off
> > C 1.8uS period ~800ns on ~1uS off
> >
> > MAM on, Access to flash at recommended 3 cycles, VPB divider at default.
> > ASM optimized near square wave with 600nS period
> > C near square wave with 736nS period
> >
> > MAM on , Access to flash at recommended 3 cycles, , VPB divider to 1
> > ASM optimized 264nS period ~168nS off ~118nS on
> > C near square wave with 416nS period
> >
> > The (hand) optimized assembly loop used is
> >
> > mov r3, #256
> > ldr r2, .L67+32
> > ldr r4, .L67+36
> > .L64:
> > str r3, [r2, #0]
> > str r3, [r4, #0]
> > b .L64
> >
> > If the output is instruction rate limited then I would expect an output
> > with an approx 2/3 duty cycle. That is only approached for the first
> > case. For all other cases there is clearly some time taken up with I/O.
> >
> > Also clearly getting maximum throughput will depend on setting up the bus
> > 'correctly'.
> >
> > Setting the VPB divider to 1 in this configuration also seems to have an
> > effect on the UART. I haven't figured that out yet but what should be
> > 9600 baud drops to about 9000 baud.
> >
> > Robert Adsett
> >
> >
> >
> >
> >
> >
> >
> > To unsubscribe from this group, send an email to:
> >
> >
> >
> >
> > ">http://docs.yahoo.com/info/terms/




At 05:47 PM 11/22/03 -0500, you wrote:

>Setting the VPB divider to 1 in this configuration also seems to have an
>effect on the UART. I haven't figured that out yet but what should be
>9600 baud drops to about 9000 baud.
>
>Robert Adsett

Got it. Cleaning up by support and generalizing it so I could place it
with some newlib support and I realized that I had misplaced the pll
divider field by 1 bit, resulting in a value of 1/2 what I expected. That
means that the internal pll was running at ~120MHz which is below the
156MHz specified minimum. Apparently when that happens some peripherals
notice the effect and others don't.

Robert

" 'Freedom' has no meaning of itself. There are always restrictions,
be they legal, genetic, or physical. If you don't believe me, try to
chew a radio signal. "

Kelvin Throop, III



Here is an explanation of the I/O toggle speed that is observed in
these devices.

Richard

The I/O speed has a maximum at ~3.7 Mhz because of several reasons,
none specific to our parts. It is caused by interactions between the
ARM pipeline, the VPB bus, the ARM AHB wrapper (interface between the
ARM7TDMI-S core and the AHB bus), and the instruction timing itself.
For the minimum 3-instruction loop below, a Store (Write to I/O pin)
followed by another Store (toggle the I/O pin) and a Branch back to
the first Store, the timing is as follows (Fe for Fetch, De for
Decode, En for execution clock n):

Pass1:

STR: Fe-De-E1-E1-E2-E2-E2-E2-E2
STR: Fe-De----------------------------E1-E1-E2-E2-E2-E2
B: Fe-----------------------------De-----------------
-----E1-E1-E2-E3

Pass2:
STR
Fe-De

And so on...

An STR to VPB space takes 8 clocks because the last 2 phases (STR is
a 4 phase instruction) are Non-Sequential (NS) accesses and the AHB
wrapper adds one wait state for every NS access. This means the 3rd
phase of the instruction takes 2 clocks, and the fourth phase takes 4
because of the wait state and the VPB operations being 3 clocks.

The second STR can be fetched and Decoded in the pipeline but will
then stall because the execution pipeline stage is busy (the first
Store has not completed yet). The Branch instruction can also be
fetched in the Decode slot of the second STR but it will then stall
because the Decode stall is occupied by the second STR.

After the first STR completes, the second STR will start its
execution phase and finally will allow the Branch instruction (which
also has one NS phase) to proceed.

End result: This takes 16 clocks (266.7 ns at 60 MHz with VPB clock
set to 1) with a duty cycle of 6:10 . Code:

.loop:
str r2, [r7, #0]
str r2, [r6, #0]
b .loop




Hello,

first many thanks for your explanation ! We had a similar thread in this
mailinglist at 1st of Feb 2004 (name: Optimization of capture routine...).
There a trick was suggested to do the acces not only once and do a jump, but
do it multiple times and then check if loop is finished (depends on what you
want to do). Then you can go much higher:

> I am now at around 5,8956 MBytes / second, which is close to 5,898 MBytes
> /
> sec. ( = Fosc * 4 / 10).
> So the two operations ( ldr ip, [r0, #0] and strb ip, [r2], #1) seems to
> take in sum 10 cycles.

Regards,

Martin ----- Original Message -----
From: "philips_apps" <>
To: <>
Sent: Wednesday, November 10, 2004 10:27 PM
Subject: [lpc2000] I/O Speed - An Explanation >
>
> Here is an explanation of the I/O toggle speed that is observed in
> these devices.
>
> Richard
>
> The I/O speed has a maximum at ~3.7 Mhz because of several reasons,
> none specific to our parts. It is caused by interactions between the
> ARM pipeline, the VPB bus, the ARM AHB wrapper (interface between the
> ARM7TDMI-S core and the AHB bus), and the instruction timing itself.
> For the minimum 3-instruction loop below, a Store (Write to I/O pin)
> followed by another Store (toggle the I/O pin) and a Branch back to
> the first Store, the timing is as follows (Fe for Fetch, De for
> Decode, En for execution clock n):
>
> Pass1:
>
> STR: Fe-De-E1-E1-E2-E2-E2-E2-E2
> STR: Fe-De----------------------------E1-E1-E2-E2-E2-E2
> B: Fe-----------------------------De-----------------
> -----E1-E1-E2-E3
>
> Pass2:
> STR
> Fe-De
>
> And so on...
>
> An STR to VPB space takes 8 clocks because the last 2 phases (STR is
> a 4 phase instruction) are Non-Sequential (NS) accesses and the AHB
> wrapper adds one wait state for every NS access. This means the 3rd
> phase of the instruction takes 2 clocks, and the fourth phase takes 4
> because of the wait state and the VPB operations being 3 clocks.
>
> The second STR can be fetched and Decoded in the pipeline but will
> then stall because the execution pipeline stage is busy (the first
> Store has not completed yet). The Branch instruction can also be
> fetched in the Decode slot of the second STR but it will then stall
> because the Decode stall is occupied by the second STR.
>
> After the first STR completes, the second STR will start its
> execution phase and finally will allow the Branch instruction (which
> also has one NS phase) to proceed.
>
> End result: This takes 16 clocks (266.7 ns at 60 MHz with VPB clock
> set to 1) with a duty cycle of 6:10 . > Code:
>
> .loop:
> str r2, [r7, #0]
> str r2, [r6, #0]
> b .loop >
>
> Yahoo! Groups Links >
>





Is there any published documentation which contains this information
(e.g.: that VPB operations are 3 clocks)?
--- In , "philips_apps" <philips_apps@y...>
wrote:
>
> Here is an explanation of the I/O toggle speed that is observed in
> these devices.
>
> Richard
>
> The I/O speed has a maximum at ~3.7 Mhz because of several reasons,
> none specific to our parts. It is caused by interactions between
the
> ARM pipeline, the VPB bus, the ARM AHB wrapper (interface between
the
> ARM7TDMI-S core and the AHB bus), and the instruction timing
itself.
> For the minimum 3-instruction loop below, a Store (Write to I/O
pin)
> followed by another Store (toggle the I/O pin) and a Branch back to
> the first Store, the timing is as follows (Fe for Fetch, De for
> Decode, En for execution clock n):
>
> Pass1:
>
> STR: Fe-De-E1-E1-E2-E2-E2-E2-E2
> STR: Fe-De----------------------------E1-E1-E2-E2-E2-E2
> B: Fe-----------------------------De-----------------
> -----E1-E1-E2-E3
>
> Pass2:
> STR
> Fe-De
>
> And so on...
>
> An STR to VPB space takes 8 clocks because the last 2 phases (STR
is
> a 4 phase instruction) are Non-Sequential (NS) accesses and the AHB
> wrapper adds one wait state for every NS access. This means the 3rd
> phase of the instruction takes 2 clocks, and the fourth phase takes
4
> because of the wait state and the VPB operations being 3 clocks.
>
> The second STR can be fetched and Decoded in the pipeline but will
> then stall because the execution pipeline stage is busy (the first
> Store has not completed yet). The Branch instruction can also be
> fetched in the Decode slot of the second STR but it will then stall
> because the Decode stall is occupied by the second STR.
>
> After the first STR completes, the second STR will start its
> execution phase and finally will allow the Branch instruction
(which
> also has one NS phase) to proceed.
>
> End result: This takes 16 clocks (266.7 ns at 60 MHz with VPB clock
> set to 1) with a duty cycle of 6:10 . > Code:
>
> .loop:
> str r2, [r7, #0]
> str r2, [r6, #0]
> b .loop