EmbeddedRelated.com
Forums
Memfault Beyond the Launch

single ARM instruction to copy C into r0 ?

Started by Francois Grieu February 14, 2007
Hello,

I'm trying to replace the ARM sequence
    MOV  R0, #0
    ADC  R0, R0, #0

with a single ARM instruction that copy C into R0,
clearing bits 1..31. I do not care for status bits
afterwards, and have no register with a known value.

Any idea ?

  Francois Grieu
On Feb 14, 5:26 pm, Francois Grieu <fgr...@gmail.com> wrote:
> Hello, > > I'm trying to replace the ARM sequence > MOV R0, #0 > ADC R0, R0, #0 > > with a single ARM instruction that copy C into R0, > clearing bits 1..31. I do not care for status bits > afterwards, and have no register with a known value. > > Any idea ? > > Francois Grieu
Hi Francois, I gave myself about an hour to think about your riddle and I don't think it's possible. I'd be very glad to hear the opposite. (Still thinking about it, although it's really time to have a nap) PS. Why comp.arch.embedded? Try comp.sys.arm....
Op Wed, 14 Feb 2007 18:26:08 +0100 schreef Francois Grieu  
<fgrieu@gmail.com>:
> I'm trying to replace the ARM sequence > MOV R0, #0 > ADC R0, R0, #0 > > with a single ARM instruction that copy C into R0, > clearing bits 1..31. I do not care for status bits > afterwards, and have no register with a known value. > > Any idea ?
Only one of these instructions will be executed: MOVCC R0, #0 MOVCS R0, #1 Does that count? -- Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/
"Boudewijn Dijkstra" <boudewijn@indes.com> wrote in message 
news:op.tnr849dcy6p7a2@ragnarok.lan...
> Op Wed, 14 Feb 2007 18:26:08 +0100 schreef Francois Grieu > <fgrieu@gmail.com>: >> I'm trying to replace the ARM sequence >> MOV R0, #0 >> ADC R0, R0, #0 >> >> with a single ARM instruction that copy C into R0, >> clearing bits 1..31. I do not care for status bits >> afterwards, and have no register with a known value. >> >> Any idea ? > > Only one of these instructions will be executed: > > MOVCC R0, #0 > MOVCS R0, #1 > > Does that count?
Surely that depends on your view of what 'executed' means in this context. Is there an ARM processor where the unexecuted instruction takes no (extra) time. I tend to think of these conditional instructions as going down the pipeline, having (some of) their results calculated, but then the writeback inhibited. To be not executed at all, not taking up any execution resources (saving power or time), would require the knowledge of the carry flag setting during decode, stalling decode if there is a carry-changing instruction ahead in the pipeline (or carry prediction). Of course, if not executed means having no architechtural side effects then presumably NOP is also not executed. OK, I'm being a pedant. Anyway, the best I can do in one instruction is SBC R0,Rn,Rn but this setc R0 to -1 if no carry and 0 if carry - one low in either case. Perhaps this can be compensated for later in the instruction stream but we have no info on how the result is to be used. Peter
On Feb 15, 9:53 am, "Peter Dickerson"
<firstname.lastn...@REMOVE.tesco.net> wrote:
> "Boudewijn Dijkstra" <boudew...@indes.com> wrote in message > > news:op.tnr849dcy6p7a2@ragnarok.lan... > > > > > Op Wed, 14 Feb 2007 18:26:08 +0100 schreef Francois Grieu > > <fgr...@gmail.com>: > >> I'm trying to replace the ARM sequence > >> MOV R0, #0 > >> ADC R0, R0, #0 > > >> with a single ARM instruction that copy C into R0, > >> clearing bits 1..31. I do not care for status bits > >> afterwards, and have no register with a known value. > > >> Any idea ? > > > Only one of these instructions will be executed: > > > MOVCC R0, #0 > > MOVCS R0, #1 > > > Does that count?
;) I'm quite sure it doesn't. 1. This is too obvious 2. Francois specified 'single instruction'.
> Surely that depends on your view of what 'executed' means in this context. > Is there an ARM processor where the unexecuted instruction takes no (extra) > time. I tend to think of these conditional instructions as going down the
In a few datasheets that I've read it was explicitly stated that 'unexecuted' conditional instructions take 1 cycle.
> pipeline, having (some of) their results calculated, but then the writeback > inhibited. To be not executed at all, not taking up any execution resources > (saving power or time), would require the knowledge of the carry flag > setting during decode, stalling decode if there is a carry-changing > instruction ahead in the pipeline (or carry prediction). > > Of course, if not executed means having no architechtural side effects then > presumably NOP is also not executed. OK, I'm being a pedant.
AFAIK, there's no NOP in ARM, "mov r0,r0" is used instead (being pedantic too).
> Anyway, the best I can do in one instruction is SBC R0,Rn,Rn but this setc > R0 to -1 if no carry and 0 if carry - one low in either case. Perhaps this > can be compensated for later in the instruction stream but we have no info > on how the result is to be used. > > Peter
I've been cracking my brains with a way to use RRX shift somehow but so far no luck. I agree that if the OP gave us a slightly larger picture we could be more productive with proposals but I guess he doesn't want to. Disclaimer: I'm not familiar with all ARM architectures & variants, so some of my statements may be wrong.
In article <I4WAh.10129$tz6.6642@newsfe2-gui.ntli.net>,
 "Peter Dickerson" <firstname.lastname@REMOVE.tesco.net> wrote:

> Anyway, the best I can do in one instruction is SBC R0,Rn,Rn but this setc > R0 to -1 if no carry and 0 if carry - one low in either case. Perhaps this > can be compensated for later in the instruction stream but we have no info > on how the result is to be used.
The context is producing, then immediately storing, the last word of the result (on m+1 word) of addition of two m-word integers in radix-2^32 representation; thus SBC R0,R0,R0 won't do without a major functional change. The feedback seems to confirm my impression that there is no way to pervert the ARM instruction set into doing what I want. I which I knew a source with examples of useful ARM idioms; my current bible is the ARM Architecture Reference Manual (ARM DDI 100E, 2000-06-23) and it is a bit shy on examples. Francois Grieu
In article <1171535295.984758.30100@m58g2000cwm.googlegroups.com>,
 "tum_" <atoumantsev_spam@mail.ru> wrote:

> if the OP gave us a slightly larger picture we could be > more productive with proposals but I guess he doesn't want to.
I can tell without needing legal advice that - CPU core is ARM922T - context is this routine (not tested), performing addition of two m-word integers in radix-2^32 representation - caller will immediately store the returned r0, and I do not want to change calling convention ; perform result = X+Y (expressed as little-endian radix 2^32) ; on entry: ; r0 points to result ; r1 and r2 point to sources X and Y ; r3 length in byte of X, Y and result, a non-negative multiple of 4 ; on exit: ; r0 is 1 or 0 depending on if result overflows or not STMFD SP!,{r4-r5} ; save temp registers used ADDS r3, r3, #0 ; Z = (r3==0), C=0 ADD r3, r3, r1 ; r3 = r3 + r1, r3 points after end of X BEQ adddone ; -> early abort if Z is set (zero length) addloop LDR r4, [r1], #4 ; get 32-bit from X, advance pointer LDR r5, [r2], #4 ; get 32-bit from Y, advance pointer ADCS r4, r4, r5 ; C:r4 = r4+r5+C (the actual arithmetic) STR r4, [r0], #4 ; store 32-bit into result, advance pointer TEQ r1, r3 ; Z = (r1==r3) BNE addloop ; -> loop until r1 reaches r3 adddone MOV R0, #0 ADC R0, R0, #0 ; r0 = C (could we save one instruction ?) LDMIA SP!,{r4-r5} ; restore temp registers used BX LR ; return to caller Optimizing this is actually not critical, but I'm compacting the code to the max as an intellectual exercise to deeply familiarize myself with ARM. Francois Grieu
Op Thu, 15 Feb 2007 11:28:16 +0100 schreef tum_ <atoumantsev_spam@mail.ru>:
> On Feb 15, 9:53 am, "Peter Dickerson" > <firstname.lastn...@REMOVE.tesco.net> wrote: >> "Boudewijn Dijkstra" <boudew...@indes.com> wrote in message >> news:op.tnr849dcy6p7a2@ragnarok.lan... >> >> > Op Wed, 14 Feb 2007 18:26:08 +0100 schreef Francois Grieu >> > <fgr...@gmail.com>: >> >> I'm trying to replace the ARM sequence >> >> MOV R0, #0 >> >> ADC R0, R0, #0 >> >> >> with a single ARM instruction that copy C into R0, >> >> clearing bits 1..31. I do not care for status bits >> >> afterwards, and have no register with a known value. >> >> >> Any idea ? >> >> > Only one of these instructions will be executed: >> >> > MOVCC R0, #0 >> > MOVCS R0, #1 >> >> > Does that count? > > ;) I'm quite sure it doesn't. > 1. This is too obvious
Often the obvious solution is accompanied with: "Why didn't I think of this before?"
> 2. Francois specified 'single instruction'.
He didn't specify whether it was supposed to be a stored instruction or an executed instruction.
>> Surely that depends on your view of what 'executed' means in this >> context. >> Is there an ARM processor where the unexecuted instruction takes no >> (extra) >> time. I tend to think of these conditional instructions as going down >> the > > In a few datasheets that I've read it was explicitly stated that > 'unexecuted' conditional instructions take 1 cycle.
Yes. The execution stage of the pipeline just waits for the next instruction to ripple through. -- Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/
On Feb 15, 10:58 am, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1171535295.984758.30...@m58g2000cwm.googlegroups.com>, > > "tum_" <atoumantsev_s...@mail.ru> wrote: > > if the OP gave us a slightly larger picture we could be > > more productive with proposals but I guess he doesn't want to. > > I can tell without needing legal advice that > - CPU core is ARM922T > - context is this routine (not tested), performing addition of > two m-word integers in radix-2^32 representation > - caller will immediately store the returned r0, and > I do not want to change calling convention > > ; perform result = X+Y (expressed as little-endian radix 2^32) > ; on entry: > ; r0 points to result > ; r1 and r2 point to sources X and Y > ; r3 length in byte of X, Y and result, a non-negative multiple of 4 > ; on exit: > ; r0 is 1 or 0 depending on if result overflows or not > STMFD SP!,{r4-r5} ; save temp registers used > > ADDS r3, r3, #0 ; Z = (r3==0), C=0 > ADD r3, r3, r1 ; r3 = r3 + r1, r3 points after end of X > BEQ adddone ; -> early abort if Z is set (zero length) > addloop > LDR r4, [r1], #4 ; get 32-bit from X, advance pointer > LDR r5, [r2], #4 ; get 32-bit from Y, advance pointer > ADCS r4, r4, r5 ; C:r4 = r4+r5+C (the actual arithmetic) > STR r4, [r0], #4 ; store 32-bit into result, advance pointer > TEQ r1, r3 ; Z = (r1==r3) > BNE addloop ; -> loop until r1 reaches r3 > adddone > MOV R0, #0 > ADC R0, R0, #0 ; r0 = C (could we save one instruction ?) > LDMIA SP!,{r4-r5} ; restore temp registers used > BX LR ; return to caller > > Optimizing this is actually not critical, but I'm compacting the code to > the max as an intellectual exercise to deeply familiarize myself with ARM. > > Francois Grieu
After 20 minutes of thinking: I can't squeeze it any further, let's see what others say. All that I can propose is: 1) use r12 instead of r5. r12 doesn't have to be preserved. This will improve speed & stack usage. 2) swap BEQ and ADD instructions, this will improve speed in case of zero length ;-). 3) When size is the issue consider using Thumb (I understand that your goal is an exercise with ARM, not Thumb). ps. my previous post still didn't appear in the thread (I use Google Groups), hopefully it will appear later but I'll paste the link here just in case: http://www.ee.ic.ac.uk/pcheung/teaching/ee2_computing/arm/Progtech.pdf
In article <1171539997.883911.255790@j27g2000cwj.googlegroups.com>,
 "tum_" <atoumantsev_spam@mail.ru> wrote:

> use r12 instead of r5. r12 doesn't have to be preserved. > This will improve speed & stack usage.
Thanks, had missed that one, although it is implied by http://www.arm.com/miscPDFs/8031.pdf Seems like, in a piece of code with only self-references (no linker veener), and calling no external code, there is a carved-in-a-next-as-strong-as-hardware-stone rule that register r12 belongs to me. After this optimization, is it worth, neutral, counterproductive or impossible to reformulate STMFD SP!,{r4} into something like STR r4, [r13,#-4] (did I get this right?); that kind of thing would be wise on a 680x0 (assuming condition codes do not matter).
> When size is the issue consider using Thumb (I understand > that your goal is an exercise with ARM, not Thumb)
Yes, thanks. Also, in the context, since there is no TEQ in Thumb, I found no way to loop without interfering with the C bit, this in turn made some extra instructions necessary; but indeed, probably still a bit more compact. Francois Grieu

Memfault Beyond the Launch