EmbeddedRelated.com
Forums
The 2024 Embedded Online Conference

Optimising GPIO on LPC1343 (Cortex M3)

Started by kevin_townsend2 December 10, 2010
I've been working on a TFT LCD driver (ILI9325) using an 8-bit interface, and have been able to get it up to about 15fps @ 72MHz paying attention to how I set GPIO (single cycle clear+set as described in section 8.5.1 of the 1343 UM, etc.). Compiling with no optimisation (GCC 4.4), it still feels kind of slow with the cmd method taking 40 cycles. Since this function is very heavily used, I was wondering if anyone could make some suggestions how I could further optimise it? I'd like to stay in C for portability, but I'm open to anything to shave a few cycles off. 40 feels like a lot!

Sorry for the longish code chunk, but for completeness sake here's the C and generated code:

---------- START CODE ----------

----- HEADER FILE (bit and register definitions) -----

// These registers allow fast single operation clear+set of bits (see section 8.5.1 of LPC1343 UM)
#define ILI9325_GPIO2DATA_DATA (*(pREG32 (GPIO_GPIO2_BASE + (ILI9325_DATA_MASK << 2))))
#define ILI9325_GPIO1DATA_WR (*(pREG32 (GPIO_GPIO1_BASE + ((1 << ILI9325_WR_PIN) << 2))))
#define ILI9325_GPIO1DATA_CD (*(pREG32 (GPIO_GPIO1_BASE + ((1 << ILI9325_CD_PIN) << 2))))
#define ILI9325_GPIO1DATA_CS (*(pREG32 (GPIO_GPIO1_BASE + ((1 << ILI9325_CS_PIN) << 2))))
#define ILI9325_GPIO1DATA_RD (*(pREG32 (GPIO_GPIO1_BASE + ((1 << ILI9325_RD_PIN) << 2))))
#define ILI9325_GPIO3DATA_RES (*(pREG32 (GPIO_GPIO3_BASE + ((1 << ILI9325_RES_PIN) << 2))))
#define ILI9325_GPIO1DATA_CS_CD (*(pREG32 (GPIO_GPIO1_BASE + ((ILI9325_CS_CD_PINS) << 2))))
#define ILI9325_GPIO1DATA_RD_WR (*(pREG32 (GPIO_GPIO1_BASE + ((ILI9325_RD_WR_PINS) << 2))))
#define ILI9325_GPIO1DATA_WR_CS (*(pREG32 (GPIO_GPIO1_BASE + ((ILI9325_WR_CS_PINS) << 2))))
#define ILI9325_GPIO1DATA_CD_RD_WR (*(pREG32 (GPIO_GPIO1_BASE + ((ILI9325_CD_RD_WR_PINS) << 2))))

#define CLR_CS_CD ILI9325_GPIO1DATA_CS_CD = (0);
#define SET_RD_WR ILI9325_GPIO1DATA_RD_WR = (ILI9325_RD_WR_PINS);
#define CLR_WR ILI9325_GPIO1DATA_WR = (0)
#define SET_WR ILI9325_GPIO1DATA_WR = (1 << ILI9325_WR_PIN)
#define SET_WR_CS ILI9325_GPIO1DATA_WR_CS = (ILI9325_WR_CS_PINS);

----- C -----

void ili9325WriteCmd(uint16_t command)
{
CLR_CS_CD; // Saves 7 commands compared to "CLR_CS; CLR_CD;"
SET_RD_WR; // Saves 7 commands compared to "SET_RD; SET_WR;"
ILI9325_GPIO2DATA_DATA = (command >> (8 - ILI9325_DATA_OFFSET));
CLR_WR;
SET_WR;
ILI9325_GPIO2DATA_DATA = command << ILI9325_DATA_OFFSET;
CLR_WR;
SET_WR_CS; // Saves 7 commands compared to "SET_WR; SET_CS;"
}

----- GENERATED ASSEMBLY (GCC 4.4, no optimisations enabled) -----

void ili9325WriteCmd(uint16_t command)
{
B480 push {r7}
B081 sub sp, #4
AF00 add r7, sp, #0
4603 mov r3, r0
803B strh r3, [r7, #0]

CLR_CS_CD; // Saves 7 commands compared to "CLR_CS; CLR_CD;"
F6404300 movw r3, #0xC00
F2C50301 movt r3, #0x5001
F04F0200 mov.w r2, #0
601A str r2, [r3, #0]

SET_RD_WR; // Saves 7 commands compared to "SET_RD; SET_WR;"
F2430300 movw r3, #0x3000
F2C50301 movt r3, #0x5001
F44F6240 mov.w r2, #0xC00
601A str r2, [r3, #0]

ILI9325_GPIO2DATA_DATA = (command >> (8 - ILI9325_DATA_OFFSET));
F24073F8 movw r3, #0x7F8
F2C50302 movt r3, #0x5002
883A ldrh r2, [r7, #0]
EA4F12D2 mov.w r2, r2, lsr #7
B292 uxth r2, r2
601A str r2, [r3, #0]

CLR_WR;
F2410300 movw r3, #0x1000
F2C50301 movt r3, #0x5001
F04F0200 mov.w r2, #0
601A str r2, [r3, #0]

SET_WR;
F2410300 movw r3, #0x1000
F2C50301 movt r3, #0x5001
F44F6280 mov.w r2, #0x400
601A str r2, [r3, #0]

ILI9325_GPIO2DATA_DATA = command << ILI9325_DATA_OFFSET;
F24073F8 movw r3, #0x7F8
F2C50302 movt r3, #0x5002
883A ldrh r2, [r7, #0]
EA4F0242 mov.w r2, r2, lsl #1
601A str r2, [r3, #0]

CLR_WR;
F2410300 movw r3, #0x1000
F2C50301 movt r3, #0x5001
F04F0200 mov.w r2, #0
601A str r2, [r3, #0]

SET_WR_CS; // Saves 7 commands compared to "SET_WR; SET_CS;"
F2414300 movw r3, #0x1400
F2C50301 movt r3, #0x5001
F44F62A0 mov.w r2, #0x500
601A str r2, [r3, #0]
}

---------- END CODE ----------

An Engineer's Guide to the LPC2100 Series

Hello,

Friday, December 10, 2010, 8:09:34 PM, you wrote:

> --- In l..., "kevin_townsend2" wrote:

>> Sorry for the longish code chunk, but for completeness sake here's the C and generated code:

> In looking at the ASM code, seems like there are at least three
> address constants that are loaded multiple times. I suspect that
> the multiple loads will be swallowed up if you turn on optimization.
> Alternatively, you could declare a register variable and load the
> constant just once into the register variable.

In high optimization levels, I expect GCC would replace values
that are known constant with the constant. This is called constant
propagation. I know it happens...

However, yes, the address constants may well be able to be loaded into
a register, if the compiler is astute. However, I know that GCC does
some rather strange things what are not intuitive.

-- Paul.

--- In l..., "kevin_townsend2" wrote:

> Sorry for the longish code chunk, but for completeness sake here's the C and generated code:

In looking at the ASM code, seems like there are at least three address constants that are loaded multiple times. I suspect that the multiple loads will be swallowed up if you turn on optimization. Alternatively, you could declare a register variable and load the constant just once into the register variable.

Paul:

I combined the first set of SET/CLR into one line (saves 4 commands), but here are the registers before that change:

http://code.google.com/p/lpc1343codebase/source/browse/branches/v0.60/drivers/lcd/tft/hw/ILI9325.h?r9

If you have a suggestion, I'd definately appreciate any suggestions.

Kevin

Hello Kevin,

You haven't provided definitions of ILI9325_DATA_MASK or
ILI9325_WR_PIN or...

So... can't compile it. A little more, please? (One assumes pREG23
is a volatile unsigned *.)

-- Paul.

Friday, December 10, 2010, 6:55:43 PM, you wrote:

> I've been working on a TFT LCD driver (ILI9325) using an 8-bit
> interface, and have been able to get it up to about 15fps @ 72MHz
> paying attention to how I set GPIO (single cycle clear+set as
> described in section 8.5.1 of the 1343 UM, etc.). Compiling with no
> optimisation (GCC 4.4), it still feels kind of slow with the cmd
> method taking 40 cycles. Since this function is very heavily used,
> I was wondering if anyone could make some suggestions how I could
> further optimise it? I'd like to stay in C for portability, but I'm
> open to anything to shave a few cycles off. 40 feels like a lot!
>
> Sorry for the longish code chunk, but for completeness sake here's the C and generated code:

> ---------- START CODE ----------

> ----- HEADER FILE (bit and register definitions) -----

> // These registers allow fast single operation clear+set of bits (see section 8.5.1 of LPC1343 UM)
> #define ILI9325_GPIO2DATA_DATA (*(pREG32 (GPIO_GPIO2_BASE + (ILI9325_DATA_MASK << 2))))
> #define ILI9325_GPIO1DATA_WR (*(pREG32 (GPIO_GPIO1_BASE + ((1 << ILI9325_WR_PIN) << 2))))
> #define ILI9325_GPIO1DATA_CD (*(pREG32 (GPIO_GPIO1_BASE + ((1 << ILI9325_CD_PIN) << 2))))
> #define ILI9325_GPIO1DATA_CS (*(pREG32 (GPIO_GPIO1_BASE + ((1 << ILI9325_CS_PIN) << 2))))
> #define ILI9325_GPIO1DATA_RD (*(pREG32 (GPIO_GPIO1_BASE + ((1 << ILI9325_RD_PIN) << 2))))
> #define ILI9325_GPIO3DATA_RES (*(pREG32 (GPIO_GPIO3_BASE + ((1 << ILI9325_RES_PIN) << 2))))
> #define ILI9325_GPIO1DATA_CS_CD (*(pREG32 (GPIO_GPIO1_BASE + ((ILI9325_CS_CD_PINS) << 2))))
> #define ILI9325_GPIO1DATA_RD_WR (*(pREG32 (GPIO_GPIO1_BASE + ((ILI9325_RD_WR_PINS) << 2))))
> #define ILI9325_GPIO1DATA_WR_CS (*(pREG32 (GPIO_GPIO1_BASE + ((ILI9325_WR_CS_PINS) << 2))))
> #define ILI9325_GPIO1DATA_CD_RD_WR (*(pREG32 (GPIO_GPIO1_BASE + ((ILI9325_CD_RD_WR_PINS) << 2))))

> #define CLR_CS_CD ILI9325_GPIO1DATA_CS_CD = (0);
> #define SET_RD_WR ILI9325_GPIO1DATA_RD_WR = (ILI9325_RD_WR_PINS);
> #define CLR_WR ILI9325_GPIO1DATA_WR = (0)
> #define SET_WR ILI9325_GPIO1DATA_WR = (1 << ILI9325_WR_PIN)
> #define SET_WR_CS ILI9325_GPIO1DATA_WR_CS = (ILI9325_WR_CS_PINS);

> ----- C -----

> void ili9325WriteCmd(uint16_t command)
> {
> CLR_CS_CD; // Saves 7 commands compared to "CLR_CS; CLR_CD;"
> SET_RD_WR; // Saves 7 commands compared to "SET_RD; SET_WR;"
> ILI9325_GPIO2DATA_DATA = (command >> (8 - ILI9325_DATA_OFFSET));
> CLR_WR;
> SET_WR;
> ILI9325_GPIO2DATA_DATA = command << ILI9325_DATA_OFFSET;
> CLR_WR;
> SET_WR_CS; // Saves 7 commands compared to "SET_WR; SET_CS;"
> }

> ----- GENERATED ASSEMBLY (GCC 4.4, no optimisations enabled) -----

> void ili9325WriteCmd(uint16_t command)
> {
> B480 push {r7}
> B081 sub sp, #4
> AF00 add r7, sp, #0
> 4603 mov r3, r0
> 803B strh r3, [r7, #0]

> CLR_CS_CD; // Saves 7 commands compared to "CLR_CS; CLR_CD;"
> F6404300 movw r3, #0xC00
> F2C50301 movt r3, #0x5001
> F04F0200 mov.w r2, #0
> 601A str r2, [r3, #0]

> SET_RD_WR; // Saves 7 commands compared to "SET_RD; SET_WR;"
> F2430300 movw r3, #0x3000
> F2C50301 movt r3, #0x5001
> F44F6240 mov.w r2, #0xC00
> 601A str r2, [r3, #0]

> ILI9325_GPIO2DATA_DATA = (command >> (8 - ILI9325_DATA_OFFSET));
> F24073F8 movw r3, #0x7F8
> F2C50302 movt r3, #0x5002
> 883A ldrh r2, [r7, #0]
> EA4F12D2 mov.w r2, r2, lsr #7
> B292 uxth r2, r2
> 601A str r2, [r3, #0]

> CLR_WR;
> F2410300 movw r3, #0x1000
> F2C50301 movt r3, #0x5001
> F04F0200 mov.w r2, #0
> 601A str r2, [r3, #0]

> SET_WR;
> F2410300 movw r3, #0x1000
> F2C50301 movt r3, #0x5001
> F44F6280 mov.w r2, #0x400
> 601A str r2, [r3, #0]

> ILI9325_GPIO2DATA_DATA = command << ILI9325_DATA_OFFSET;
> F24073F8 movw r3, #0x7F8
> F2C50302 movt r3, #0x5002
> 883A ldrh r2, [r7, #0]
> EA4F0242 mov.w r2, r2, lsl #1
> 601A str r2, [r3, #0]

> CLR_WR;
> F2410300 movw r3, #0x1000
> F2C50301 movt r3, #0x5001
> F04F0200 mov.w r2, #0
> 601A str r2, [r3, #0]

> SET_WR_CS; // Saves 7 commands compared to "SET_WR; SET_CS;"
> F2414300 movw r3, #0x1400
> F2C50301 movt r3, #0x5001
> F44F62A0 mov.w r2, #0x500
> 601A str r2, [r3, #0]
> }

> ---------- END CODE ----------

>

>
Paul:

> However, yes, the address constants may well be able to be loaded into
> a register, if the compiler is astute. However, I know that GCC does
> some rather strange things what are not intuitive.

I generally compile with -0s (32kb part!), and it definately improves the performance a fair amount (6-8fps w/o optimisation to 15fps with), so there's room for improvement, but it's hard to guess where. Is there a straight-forward way to find a function in optimised code to see what it's changing?

I assume writing the best C code possible will hopefully improve the optimised code as well, but I'll need to do some benchmarks to really find out.

Kevin

Hello kevin_townsend2,

Friday, December 10, 2010, 8:17:12 PM, you wrote:

> Paul:

> I combined the first set of SET/CLR into one line (saves 4
> commands), but here are the registers before that change:

> http://code.google.com/p/lpc1343codebase/source/browse/branches/v0.60/drivers/lcd/tft/hw/ILI9325.h?r9

> If you have a suggestion, I'd definately appreciate any suggestions.

Hi Kevin,

If you change it from "unsigned short cmd" (in effect) to "unsigned
cmd" then it improves.

Compiled -O3 (not brilliant, IMO):

B470 push {r4-r6}
F44F6540 mov.w r5, #0xC00
2200 movs r2, #0
F2C50501 movt r5, #0x5001
F44F5440 mov.w r4, #0x3000
602A str r2, [r5]
F2C50401 movt r4, #0x5001
F44F61FF mov.w r1, #0x7F8
F44F5380 mov.w r3, #0x1000
F44F6540 mov.w r5, #0xC00
6025 str r5, [r4]
F2C50301 movt r3, #0x5001
F2C50102 movt r1, #0x5002
09C6 lsrs r6, r0, #7
F44F5CA0 mov.w r12, #0x1400
0040 lsls r0, r0, #1
F44F6480 mov.w r4, #0x400
600E str r6, [r1]
F2C50C01 movt r12, #0x5001
601A str r2, [r3]
601C str r4, [r3]
6008 str r0, [r1]
601A str r2, [r3]
F44F63A0 mov.w r3, #0x500
F8CC3000 str.w r3, [r12, #0]
BC70 pop {r4-r6}
4770 bx lr
0x52 words
Compiled -O1:

F44F6340 mov.w r3, #0xC00
F2C50301 movt r3, #0x5001
F04F0200 mov.w r2, #0
601A str r2, [r3]
F44F5340 mov.w r3, #0x3000
F2C50301 movt r3, #0x5001
F44F6140 mov.w r1, #0xC00
6019 str r1, [r3]
EA4F13D0 lsr.w r3, r0, #7
F44F61FF mov.w r1, #0x7F8
F2C50102 movt r1, #0x5002
600B str r3, [r1]
F44F5380 mov.w r3, #0x1000
F2C50301 movt r3, #0x5001
601A str r2, [r3]
F44F6C80 mov.w r12, #0x400
F8C3C000 str.w r12, [r3, #0]
EA4F0040 lsl.w r0, r0, #1
6008 str r0, [r1]
601A str r2, [r3]
F44F53A0 mov.w r3, #0x1400
F2C50301 movt r3, #0x5001
F44F62A0 mov.w r2, #0x500
601A str r2, [r3]
4770 bx lr
0x54 words
Compiled -Os:

4B0C ldr r3, 0x00000034
2200 movs r2, #0
601A str r2, [r3]
F44F6140 mov.w r1, #0xC00
F5035310 add.w r3, r3, #0x2400
6019 str r1, [r3]
4909 ldr r1, 0x00000038
09C3 lsrs r3, r0, #7
600B str r3, [r1]
4B09 ldr r3, 0x0000003C
F44F6C80 mov.w r12, #0x400
0040 lsls r0, r0, #1
601A str r2, [r3]
F8C3C000 str.w r12, [r3, #0]
6008 str r0, [r1]
601A str r2, [r3]
F50262A0 add.w r2, r2, #0x500
4463 add r3, r12
601A str r2, [r3]
4770 bx lr
BF00 nop -- constants follow
0C00 lsrs r0, r0, #16
5001 str r1, [r0, r0]
07F8 lsls r0, r7, #31
5002 str r2, [r0, r0]
1000 asrs r0, r0, #0
5001 str r1, [r0, r0]
0x40 words

All with gcc 4.4.3. -O1 looks like a good bet.

However, if I use clang 2.9 and the wavefront LLVM with -O3 (I didn't
customize the optimization further):

F6404100 movw r1, #0xC00
2200 movs r2, #0
F44F6340 mov.w r3, #0xC00
F44F6C80 mov.w r12, #0x400
F2C50101 movt r1, #0x5001
600A str r2, [r1]
F2430100 movw r1, #0x3000
F2C50101 movt r1, #0x5001
600B str r3, [r1]
F24073F8 movw r3, #0x7F8
09C1 lsrs r1, r0, #7
F2C50302 movt r3, #0x5002
6019 str r1, [r3]
0040 lsls r0, r0, #1
F2410100 movw r1, #0x1000
F2C50101 movt r1, #0x5001
600A str r2, [r1]
F8C1C000 str.w r12, [r1, #0]
6018 str r0, [r3]
F2414000 movw r0, #0x1400
600A str r2, [r1]
F2C50001 movt r0, #0x5001
F44F61A0 mov.w r1, #0x500
6001 str r1, [r0]
4770 bx lr
0x4e words

Yeah, better than GCC.

-- Paul.

Paul:

Thanks for looking into it (somewhat exhaustively!). I checked myself before seeing your reply, and noticed that with -0s it went from 36 to 25 instructions. Not bad, but I'm really not familiar enough with ARM assembly to know if there is still room to shave some extra fat off or not. There's necessarily 7 register changes and I don't think I can reduce that further. I have the pins grouped on single GPIO banks for efficiency as well.

I'm OK with ~15fps if this looks like the best possible, just wanted to know if this was the best I could do in these methods. It's always a useful exercise to look at things at this level anyway and I don't do it enough.

--- C --
#define ILI9325_GPIO1DATA_CS_CD_RD_WR (*(pREG32 (GPIO_GPIO1_BASE + ((ILI9325_CS_CD_RD_WR_PINS) << 2))))
#define CLR_CS_CD_SET_RD_WR ILI9325_GPIO1DATA_CS_CD_RD_WR = (ILI9325_RD_WR_PINS);

void ili9325WriteCmd(uint16_t command)
{
CLR_CS_CD_SET_RD_WR;
ILI9325_GPIO2DATA_DATA = (command >> (8 - ILI9325_DATA_OFFSET));
CLR_WR;
SET_WR;
ILI9325_GPIO2DATA_DATA = command << ILI9325_DATA_OFFSET;
CLR_WR;
SET_WR_CS;
}

--- ASM - GCC 4.4 -0s ---

4B0B ldr r3, [pc, #0x2C]
B280 uxth r0, r0
F44F6240 mov.w r2, #0xC00
490A ldr r1, [pc, #0x28]
601A str r2, [r3, #0]
09C3 lsrs r3, r0, #7
600B str r3, [r1, #0]
4B09 ldr r3, [pc, #0x24]
F5A26240 sub.w r2, r2, #0xC00
F44F6C80 mov.w r12, #0x400
0040 lsls r0, r0, #1
601A str r2, [r3, #0]
F8C3C000 str.w r12, [r3, #0]
6008 str r0, [r1, #0]
601A str r2, [r3, #0]
F50262A0 add.w r2, r2, #0x500
4463 add r3, r12
601A str r2, [r3, #0]
4770 bx lr
3C00 subs r4, #0
5001 str r1, [r0, r0]
07F8 lsls r0, r7, #31
5002 str r2, [r0, r0]
1000 asrs r0, r0, #0
5001 str r1, [r0, r0]

25 words

-----

Switching to 'unsigned int' (still -Os) as you suggested I get:


4B0B ldr r3, [pc, #0x2C]
F44F6240 mov.w r2, #0xC00
490B ldr r1, [pc, #0x2C]
601A str r2, [r3, #0]
09C3 lsrs r3, r0, #7
600B str r3, [r1, #0]
4B0A ldr r3, [pc, #0x28]
F5A26240 sub.w r2, r2, #0xC00
F44F6C80 mov.w r12, #0x400
0040 lsls r0, r0, #1
601A str r2, [r3, #0]
F8C3C000 str.w r12, [r3, #0]
6008 str r0, [r1, #0]
601A str r2, [r3, #0]
F50262A0 add.w r2, r2, #0x500
4463 add r3, r12
601A str r2, [r3, #0]
4770 bx lr
BF00 nop
3C00 subs r4, #0
5001 str r1, [r0, r0]
07F8 lsls r0, r7, #31
5002 str r2, [r0, r0]
1000 asrs r0, r0, #0
5001 str r1, [r0, r0]

Also 25 lines

25 cycles per pixel writing consecutively (one cmd then endless data writes), or ~150 cycles writing pixels randomly (set x, set y, set color). A bit more than I expected (pehraps I'm naive?), but screen updates at least feel instant. There's a bit of tearing visible on full screen animations, but for a $7 240x320 16-bit LCD on a $3 MCU I probably can't demand too much either.


The 2024 Embedded Online Conference