LPC17xx GPIO bit-banding slower ?

Started by talikan January 11, 2010
Hello,

I'm quite new to ARM programming and then I'm starting with something simple : experimenting with GPIO on my LPC1756.

BTW, I'm using Codesourcery G++ lite + Eclipse + JLINK, and my LPC1756 is running at 4MHz on the internal oscillator.

My aim was to see how fast the GPIO could go...

1) using CMSIS' GPIO_SetValue and GPIO_ClearValue, I could reach 125kHz as output frequency. As I don't know how to measure the number of cycles for a given instruction (how can we do that?), I guessed that about 4MHz/125kHz2 cycles were needed for both commands, so about 16 cycles for each set/clear.

2) by directly writing to GPIO2->FIOPIN, I could reach a 1MHz ouput, which is great and gives an average of 2 cycles of each set/clear. The duty cycle was closer to 25% than 50% though.

3) I then read the bit-banding chapter in the user manual, and prepared for an even better result since I thought that bitbanding could bring some 1 cycle set/clear, but as a result, the output frequency dropped to 660kHz, which bring about 3 cycles per set/clear command...

How come that this bit-banding instruction
*((volatile unsigned int *)0x23380a9c) = 1;
used to set the P2.7 pin is slower (about 3 cycles) than this one
GPIO2->FIOPIN=0x80
(about 2 cycles) ?

I've used different optimization flags for gcc, but the result is the same...

Thanks for you help.

Nicolas.

An Engineer's Guide to the LPC2100 Series

Hi,

> My aim was to see how fast the GPIO could go...
>
> 1) using CMSIS' GPIO_SetValue and GPIO_ClearValue, I could reach 125kHz as
> output frequency. As I don't know how to measure the number of cycles for
a
> given instruction (how can we do that?), I guessed that about
> 4MHz/125kHz=32 cycles were needed for both commands, so about 16 cycles
for
> each set/clear.
>
> 2) by directly writing to GPIO2->FIOPIN, I could reach a 1MHz ouput, which
> is great and gives an average of 2 cycles of each set/clear. The duty
cycle
> was closer to 25% than 50% though.
>
> 3) I then read the bit-banding chapter in the user manual, and prepared
for
> an even better result since I thought that bitbanding could bring some 1
> cycle set/clear, but as a result, the output frequency dropped to 660kHz,
> which bring about 3 cycles per set/clear command...
>
> How come that this bit-banding instruction
> *((volatile unsigned int *)0x23380a9c) = 1;
> used to set the P2.7 pin is slower (about 3 cycles) than this one
> GPIO2->FIOPIN=0x80
> (about 2 cycles) ?
>
> I've used different optimization flags for gcc, but the result is the
> same...
>
> Thanks for you help.

Bit-banding does not accelerate GPIO if all you do is write to the pins.
Bit-banding accelerates writing to *bits* in a port, i.e. it does a
read-modify-write. Your non-bit-band code does not do a RMW, it does a
straight assign--hence it is *not* a like-for-like comparison. The same can
be achieved using, for instance, the LPC2k FIO0SET and FIO0CLR registers,
the mask registers, or the byte/half-word access registers.

...and you're using a compiler, which will do what it likes. So I suggest
you try coding it in assembly for a comparison.

--
Paul Curtis, Rowley Associates Ltd http://www.rowley.co.uk
CrossWorks V2 is out for LPC1700, LPC3100, LPC3200, SAM9, and more!

Ok, understood.

Thanks for your help!

Nicolas.

--- In l..., "Paul Curtis" wrote:
>
> Hi,
>
> > My aim was to see how fast the GPIO could go...
> >
> > 1) using CMSIS' GPIO_SetValue and GPIO_ClearValue, I could reach 125kHz as
> > output frequency. As I don't know how to measure the number of cycles for
> a
> > given instruction (how can we do that?), I guessed that about
> > 4MHz/125kHz=32 cycles were needed for both commands, so about 16 cycles
> for
> > each set/clear.
> >
> > 2) by directly writing to GPIO2->FIOPIN, I could reach a 1MHz ouput, which
> > is great and gives an average of 2 cycles of each set/clear. The duty
> cycle
> > was closer to 25% than 50% though.
> >
> > 3) I then read the bit-banding chapter in the user manual, and prepared
> for
> > an even better result since I thought that bitbanding could bring some 1
> > cycle set/clear, but as a result, the output frequency dropped to 660kHz,
> > which bring about 3 cycles per set/clear command...
> >
> > How come that this bit-banding instruction
> > *((volatile unsigned int *)0x23380a9c) = 1;
> > used to set the P2.7 pin is slower (about 3 cycles) than this one
> > GPIO2->FIOPIN=0x80
> > (about 2 cycles) ?
> >
> > I've used different optimization flags for gcc, but the result is the
> > same...
> >
> > Thanks for you help.
>
> Bit-banding does not accelerate GPIO if all you do is write to the pins.
> Bit-banding accelerates writing to *bits* in a port, i.e. it does a
> read-modify-write. Your non-bit-band code does not do a RMW, it does a
> straight assign--hence it is *not* a like-for-like comparison. The same can
> be achieved using, for instance, the LPC2k FIO0SET and FIO0CLR registers,
> the mask registers, or the byte/half-word access registers.
>
> ...and you're using a compiler, which will do what it likes. So I suggest
> you try coding it in assembly for a comparison.
>
> --
> Paul Curtis, Rowley Associates Ltd http://www.rowley.co.uk
> CrossWorks V2 is out for LPC1700, LPC3100, LPC3200, SAM9, and more!
>

I'd be interested if someone knew cycle counts for the lpc17xx

read ram,
read i/o,
write ram,
write i/o,
irq in,
irq out,
push/pop set of registers,
branch

--- In l..., "John S" wrote:
> I'd be interested if someone knew cycle counts for the lpc17xx
>
> read ram,
> read i/o,
> write ram,
> write i/o,
> irq in,
> irq out,
> push/pop set of registers,
> branch
>

I guess I would expect to find that stuff in the "Cortex-M3 Technical Reference Manual" over at the ARM site.

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337g/DDI0337G_cortex_m3_r2p0_trm.pdf

You may find these timings to be non-trivial as, for example, the pipeline must reload following a branch. Many instructions are rated in terms of "normally" or "usually".

You may also get some information from the NXP Datasheet and the NXP User Manual.

But unless you want to write assembly code, there isn't a lot you can do about timing. The compiler does whatever it wants as a function of the optimization settings.

Richard

--- In l..., "John S" wrote:
> I'd be interested if someone knew cycle counts for the lpc17xx
>
> read ram,
> read i/o,
> write ram,
> write i/o,
> irq in,
> irq out,
> push/pop set of registers,
> branch
>
Don't forget to include any wait states imposed on the core by the chip design.

Richard

I currently am using lpc23xx in applications that use the peripherals a lot. And have found that load/store to i/o can take a while (depending on MAM, peripheral clock).

So, considering if migration to lpc17xx is worth it.
From what I see so far the NVIC is a huge improvement over the VIC. Also LDR pipelining might help some. Clock rate helps a little. And there are about 2 handfuls of other niceties.

As you say, what the wait states are is important, and I am not sure where to look for them.

> Don't forget to include any wait states imposed on the core by the chip design.
>
> Richard
>

--- In l..., "John S" wrote:
> I currently am using lpc23xx in applications that use the peripherals a lot. And have found that load/store to i/o can take a while (depending on MAM, peripheral clock).
>
> So, considering if migration to lpc17xx is worth it.
> From what I see so far the NVIC is a huge improvement over the VIC. Also LDR pipelining might help some. Clock rate helps a little. And there are about 2 handfuls of other niceties.
>
> As you say, what the wait states are is important, and I am not sure where to look for them.
>
> > Don't forget to include any wait states imposed on the core by the chip design.
> >
> > Richard
>
If your app is so timing sensitive that you are looking at minor improvements (and a much faster clock), try a different architecture. The Blackfin runs at 500 MHz and will blow the socks off of these low end ARMs. The Intel PXAs are another choice.

If you want to know how the chip works, you're going to have to spend a lot of time with the User Manual and the Tech Ref Manual. Specifically, flash wait states are covered in Chapt 5 Section 4 of the NXP User Manual. Table 49 sums it up. After you read the table, you'll probably wonder why you want to run so fast if you spend all your time waiting on flash. Beats me!

The thing about pipelined processors is that you can't be certain how long any given instruction will take. Sure, the instruction is executed in one clock, for example, but only after it has been fetched and that might take 4 clocks. Or not... The LPC2148 grabs 128 bits of flash at a time. And I believe it has 3 prefetch registers so it can look down both paths from a branch (I think...). I don't know how the LPC17xx works. The point is, these aren't 8085's or 8051's where it is known exactly how long each instruction takes to execute.

Richard

There isn't much problem with sequential code as the flash accelerator gets 128 bits at a time (8 thumb instructions) in 5 cycles (@100Mhz). branches could hurt. But some code could be executed out of RAM if needed.

my app currently works in the 23xx, but interrupt latency is a performance bottleneck. the 17xx could help quite a bit.

>
> If your app is so timing sensitive that you are looking at minor improvements (and a much faster clock), try a different architecture. The Blackfin runs at 500 MHz and will blow the socks off of these low end ARMs. The Intel PXAs are another choice.
>
> If you want to know how the chip works, you're going to have to spend a lot of time with the User Manual and the Tech Ref Manual. Specifically, flash wait states are covered in Chapt 5 Section 4 of the NXP User Manual. Table 49 sums it up. After you read the table, you'll probably wonder why you want to run so fast if you spend all your time waiting on flash. Beats me!
>
> The thing about pipelined processors is that you can't be certain how long any given instruction will take. Sure, the instruction is executed in one clock, for example, but only after it has been fetched and that might take 4 clocks. Or not... The LPC2148 grabs 128 bits of flash at a time. And I believe it has 3 prefetch registers so it can look down both paths from a branch (I think...). I don't know how the LPC17xx works. The point is, these aren't 8085's or 8051's where it is known exactly how long each instruction takes to execute.
>
> Richard
>