single ARM instruction to copy C into r0 ?| page 2

Reply by Laurent ●February 15, 20072007-02-15

Francois Grieu wrote:
>>When size is the issue consider using Thumb (I understand
>>that your goal is an exercise with ARM, not Thumb)
> 
> Yes, thanks. Also, in the context, since there is no TEQ in Thumb,
> I found no way to loop without interfering with the C bit, this
> in turn made some extra instructions necessary; but indeed,
> probably still a bit more compact.

Doesn't EOR fit your needs?


			Laurent

Reply by Laurent ●February 15, 20072007-02-15

Laurent wrote:
>> Yes, thanks. Also, in the context, since there is no TEQ in Thumb,
>> I found no way to loop without interfering with the C bit, this
>> in turn made some extra instructions necessary; but indeed,
>> probably still a bit more compact.
> 
> 
> Doesn't EOR fit your needs?

No it doesn't, sorry :)


			Laurent

Reply by tum_ ●February 15, 20072007-02-15

On Feb 15, 1:09 pm, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1171539997.883911.255...@j27g2000cwj.googlegroups.com>,
>
>  "tum_" <atoumantsev_s...@mail.ru> wrote:
> > use r12 instead of r5. r12 doesn't have to be preserved.
> > This will improve speed & stack usage.
>
> Thanks, had missed that one, although it is implied byhttp://www.arm.com/miscPDFs/8031.pdf
>
> Seems like, in a piece of code with only self-references
> (no linker veener), and calling no external code, there is a
> carved-in-a-next-as-strong-as-hardware-stone rule that
> register r12 belongs to me.

Yes. If you're dealing with software that is conformant with Procedure
Call Standards proposed by ARM.
There may be software out there that is not conformant.

> After this optimization, is it worth, neutral, counterproductive
> or impossible to reformulate STMFD  SP!,{r4} into something like
>   STR r4, [r13,#-4]
> (did I get this right?);

;) you forgot the '!' at the end (but I had to peep into the manual to
correct you, I don't know this stuff by heart).
To the best of my (limited) knowledge, the two instructions are
identical in effect/speed/size for ARM7 & 9 cores.


> that kind of thing would be wise on a
> 680x0 (assuming condition codes do not matter).

Why would it be wise? (not familiar with 68k)

> > When size is the issue consider using Thumb (I understand
> > that your goal is an exercise with ARM, not Thumb)
>
> Yes, thanks. Also, in the context, since there is no TEQ in Thumb,
> I found no way to loop without interfering with the C bit, this
> in turn made some extra instructions necessary; but indeed,
> probably still a bit more compact.
>
>   Francois Grieu

Reply by Peter Dickerson ●February 15, 20072007-02-15

"Francois Grieu" <fgrieu@gmail.com> wrote in message 
news:fgrieu-DDB419.14093915022007@news-3.proxad.net...
> In article <1171539997.883911.255790@j27g2000cwj.googlegroups.com>,
> "tum_" <atoumantsev_spam@mail.ru> wrote:
>
>> use r12 instead of r5. r12 doesn't have to be preserved.
>> This will improve speed & stack usage.
>
> Thanks, had missed that one, although it is implied by
> http://www.arm.com/miscPDFs/8031.pdf
>
> Seems like, in a piece of code with only self-references
> (no linker veener), and calling no external code, there is a
> carved-in-a-next-as-strong-as-hardware-stone rule that
> register r12 belongs to me.
>
> After this optimization, is it worth, neutral, counterproductive
> or impossible to reformulate STMFD  SP!,{r4} into something like
>  STR r4, [r13,#-4]
> (did I get this right?); that kind of thing would be wise on a
> 680x0 (assuming condition codes do not matter).
>
>
>> When size is the issue consider using Thumb (I understand
>> that your goal is an exercise with ARM, not Thumb)
>
> Yes, thanks. Also, in the context, since there is no TEQ in Thumb,
> I found no way to loop without interfering with the C bit, this
> in turn made some extra instructions necessary; but indeed,
> probably still a bit more compact.

I can make it more compact by removing two instructions from outside the 
loop and adding one inside and changing one slightly. Leave R3 as a count, 
counting down by 4. Then after the loop R3 is known to be zero so use ADC 
R0,R3,#0.

Peter

Reply by Wilco Dijkstra ●February 15, 20072007-02-15

"Francois Grieu" <fgrieu@gmail.com> wrote in message 
news:fgrieu-BC8DB0.11584515022007@news-3.proxad.net...

> Optimizing this is actually not critical, but I'm compacting the code to
> the max as an intellectual exercise to deeply familiarize myself with ARM.

How about (using new UAL syntax):

        PUSH          {r14}
        ADDS          r3, r3, r1              ; r3 = r3 + r1, r3 points 
after end of X, C = 0
        B                  loopstart
addloop
        LDR             r14, [r1], #4          ; get 32-bit from X, advance 
pointer
        LDR             r12, [r2], #4          ; get 32-bit from Y, advance 
pointer
        ADCS          r14, r14, r12        ; C:r4 = r14+r12+C  (the actual 
arithmetic)
        STR             r14, [r0], #4          ; store 32-bit into result, 
advance pointer
loopstart
        EORS          r14, r1, r3             ; Z = (r1==r3), r14 = 0
        BNE             addloop               ; -> loop until r1 reaches r3
        ADC             r0, r14, #0           ; r0 = C
        POP             {pc}

Note this assumes r1 + r3 doesn't overflow, ie. the array pointed to by r3
doesn't wrap around at the end of memory.

Wilco

Reply by tum_ ●February 15, 20072007-02-15

On Feb 15, 3:13 pm, "Peter Dickerson"
<firstname.lastn...@REMOVE.tesco.net> wrote:
> "Francois Grieu" <fgr...@gmail.com> wrote in message
>
> news:fgrieu-DDB419.14093915022007@news-3.proxad.net...
>
>
>
> > In article <1171539997.883911.255...@j27g2000cwj.googlegroups.com>,
> > "tum_" <atoumantsev_s...@mail.ru> wrote:
>
> >> use r12 instead of r5. r12 doesn't have to be preserved.
> >> This will improve speed & stack usage.
>
> > Thanks, had missed that one, although it is implied by
> >http://www.arm.com/miscPDFs/8031.pdf
>
> > Seems like, in a piece of code with only self-references
> > (no linker veener), and calling no external code, there is a
> > carved-in-a-next-as-strong-as-hardware-stone rule that
> > register r12 belongs to me.
>
> > After this optimization, is it worth, neutral, counterproductive
> > or impossible to reformulate STMFD  SP!,{r4} into something like
> >  STR r4, [r13,#-4]
> > (did I get this right?); that kind of thing would be wise on a
> > 680x0 (assuming condition codes do not matter).
>
> >> When size is the issue consider using Thumb (I understand
> >> that your goal is an exercise with ARM, not Thumb)
>
> > Yes, thanks. Also, in the context, since there is no TEQ in Thumb,
> > I found no way to loop without interfering with the C bit, this
> > in turn made some extra instructions necessary; but indeed,
> > probably still a bit more compact.
>
> I can make it more compact by removing two instructions from outside the
> loop and adding one inside and changing one slightly. Leave R3 as a count,
> counting down by 4. Then after the loop R3 is known to be zero so use ADC
> R0,R3,#0.
>
> Peter

SUBS r3,r3,#4 ?

But this will kill the carry... or am I missing something? sorry, a
bit in a haste at the moment.

Reply by tum_ ●February 15, 20072007-02-15

On Feb 15, 3:19 pm, "Wilco Dijkstra" <Wilco_dot_Dijks...@ntlworld.com>
wrote:
> "Francois Grieu" <fgr...@gmail.com> wrote in message
>
> news:fgrieu-BC8DB0.11584515022007@news-3.proxad.net...
>
> > Optimizing this is actually not critical, but I'm compacting the code to
> > the max as an intellectual exercise to deeply familiarize myself with ARM.
>
> How about (using new UAL syntax):
>
>         PUSH          {r14}
>         ADDS          r3, r3, r1              ; r3 = r3 + r1, r3 points
> after end of X, C = 0
>         B                  loopstart
> addloop
>         LDR             r14, [r1], #4          ; get 32-bit from X, advance
> pointer
>         LDR             r12, [r2], #4          ; get 32-bit from Y, advance
> pointer
>         ADCS          r14, r14, r12        ; C:r4 = r14+r12+C  (the actual
> arithmetic)
>         STR             r14, [r0], #4          ; store 32-bit into result,
> advance pointer
> loopstart
>         EORS          r14, r1, r3             ; Z = (r1==r3), r14 = 0
>         BNE             addloop               ; -> loop until r1 reaches r3
>         ADC             r0, r14, #0           ; r0 = C
>         POP             {pc}
>
> Note this assumes r1 + r3 doesn't overflow, ie. the array pointed to by r3
> doesn't wrap around at the end of memory.
>
> Wilco

Nice.
Just another example of a solution that appears obvious after someone
has shown it to you ;)))
Nice. EORS doesn't touch the C if there are no shifts involved.

Reply by Wilco Dijkstra ●February 15, 20072007-02-15

"tum_" <atoumantsev_spam@mail.ru> wrote in message 
news:1171550594.264411.38720@j27g2000cwj.googlegroups.com...
> On Feb 15, 1:09 pm, Francois Grieu <fgr...@gmail.com> wrote:

>> After this optimization, is it worth, neutral, counterproductive
>> or impossible to reformulate STMFD  SP!,{r4} into something like
>>   STR r4, [r13,#-4]
>> (did I get this right?);
>
> ;) you forgot the '!' at the end (but I had to peep into the manual to
> correct you, I don't know this stuff by heart).
> To the best of my (limited) knowledge, the two instructions are
> identical in effect/speed/size for ARM7 & 9 cores.

No, LDM/STM of one register is takes 2 cycles on ARM9 while a
LDR takes just 1, so it is best to avoid single register LDMs on ARM9.
Thumb-2 doesn't support single register LDM/STM although Thumb-1
supports single register PUSH/POP. They are useful for codesize.

Wilco

Reply by tum_ ●February 15, 20072007-02-15

On Feb 15, 3:35 pm, "Wilco Dijkstra" <Wilco_dot_Dijks...@ntlworld.com>
wrote:
> "tum_" <atoumantsev_s...@mail.ru> wrote in message
>
> news:1171550594.264411.38720@j27g2000cwj.googlegroups.com...
>
> > On Feb 15, 1:09 pm, Francois Grieu <fgr...@gmail.com> wrote:
> >> After this optimization, is it worth, neutral, counterproductive
> >> or impossible to reformulate STMFD  SP!,{r4} into something like
> >>   STR r4, [r13,#-4]
> >> (did I get this right?);
>
> > ;) you forgot the '!' at the end (but I had to peep into the manual to
> > correct you, I don't know this stuff by heart).
> > To the best of my (limited) knowledge, the two instructions are
> > identical in effect/speed/size for ARM7 & 9 cores.
>
> No, LDM/STM of one register is takes 2 cycles on ARM9 while a
> LDR takes just 1, so it is best to avoid single register LDMs on ARM9.
> Thumb-2 doesn't support single register LDM/STM although Thumb-1
> supports single register PUSH/POP. They are useful for codesize.
>
> Wilco

Thanks. ARM9 is still new to me.

Reply by Francois Grieu ●February 15, 20072007-02-15

In article <4S_Ah.11285$Zl6.274@newsfe3-win.ntli.net>,
 "Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> proposed:

> (using new UAL syntax):
> 
>; perform result = X+Y (expressed as little-endian radix 2^32)
>; on entry:
>;   r0 points to result
>;   r1 and r2 point to sources X and Y
>;   r3 length in byte of X, Y and result, a non-negative multiple of 4
>; on exit:
>;   r0 is 1 or 0 depending on if result overflows or not
>     PUSH    {r14}
>     ADDS    r3, r3, r1          ; r3 = r3 + r1, r3 points after end of X, C = 0
>     B       loopstart
> addloop
>     LDR     r14, [r1], #4       ; get 32-bit from X, advance pointer
>     LDR     r12, [r2], #4       ; get 32-bit from Y, advance pointer
>     ADCS    r14, r14, r12       ; C:r4 = r14+r12+C  (the actual arithmetic)
>     STR     r14, [r0], #4       ; store 32-bit into result, advance pointer
> loopstart
>     EORS    r14, r1, r3         ; Z = (r1==r3), r14 = 0
>     BNE     addloop             ; -> loop until r1 reaches r3
>     ADC     r0, r14, #0         ; r0 = C
>     POP     {pc}
> 
> Note this assumes r1 + r3 doesn't overflow, ie. the array pointed to by r3
> doesn't wrap around at the end of memory.

Actually the assumption is stronger, and quite a bit less safe: it is that
the array pointed to by r3 does not REACH 0xFFFFFFFF.   ADDS r3, r3, r1
is still a nice trick, if not one that I would dare to promote heavily.

The real gem is   EORS r14, r1, r3  and how it leaves R14 zeroed.
I had wrongly concluded that "C Flag = shifter_carry_out" meant that C
was destroyed by EORS, and now realize it is not, which opens a whole
new universe of possibilites. Thanks a lot.

Also I like PUSH {r14} / POP {pc}
After considerable hunt in ARM DDI 0100E (2000-06-23), I conclude that
"On architecture version 5 and above" (my target), it is a perfectly
legitimate idiom to preserve a working register, and return, including
switching back to Thumb mode as needed.
This can be put to excellent use in a lot of code; looks like if a
terminal subroutine needs to preserve some registers for temp usage,
it pays to make r14 part of the temporary registers pool, and return
by restoring the saved r14/LR into r15/PC, leaving r14 indeterminate,
which is allowed by the usual calling conventions.


Thanks a lot, Wilco Dijkstra. BTW that was fun to see somemone with
your name use   B loopstart ;-)
Is it a FAQ to ask the relationship with Edsger W. Dijkstra?


   Francois Grieu

Previous 123 Next

single ARM instruction to copy C into r0 ?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group