Forums

arm11/armv6 right shift signed packed values

Started by Unknown August 12, 2008
I'm attempting to pack some numbers for output after doing some work.
They're currently r7 = v0|v4 and r10 = v1|v5. They all need to be >>3,
before or after repacking. Output will be v1|v0 and v5|v4 (little
endian architecture). I managed to get the v1|v0 written reasonably
efficiently:
mov		r8, r10, asr #3		; 1>>3|xxx
pkhtb	r8, r8, r7, asr #19	; 1>>3|,0>>3
str		r8, [r0], r2			; o1|o0, post inc

But v5|v4 is a little ugly because I'm starting with the least
significant bits, so right shifting is going to drag in the bottom of
the upper word (right?). Right now I'm sign extending, then writing
individual shorts.
mov		r8, r10, asr #3	; 5 >> 3
strh		r8, [r0, #2]		; o5
sxth		r1, r1			;

sxth		r7, r7			;
mov		r8, r7, asr #3	; 4 >> 3
strh		r8, [r0], r2		; o4, post inc

I found http://forums.arm.com/index.php?showtopic=12823&pid=30383&st=0&#entry30383
PKHBT R3, R1, R2, LSL #15 ; R3 = [R2>>1, R1]
PKHTB R3, R3, R1, ASR #1 ; R3 = [R2>>1, R1>>1]
However, that seems to rely on the input being full words.

Is there a better way to do this?
On Aug 12, 10:22=A0am, johann.koe...@gmail.com wrote:
> mov =A0 =A0 =A0 =A0 =A0 =A0 r8, r10, asr #3 ; 5 >> 3 > strh =A0 =A0 =A0 =A0 =A0 =A0r8, [r0, #2] =A0 =A0 =A0 =A0 =A0 =A0; o5 > sxth =A0 =A0 =A0 =A0 =A0 =A0r1, r1 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0; > > sxth =A0 =A0 =A0 =A0 =A0 =A0r7, r7 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0; > mov =A0 =A0 =A0 =A0 =A0 =A0 r8, r7, asr #3 =A0; 4 >> 3 > strh =A0 =A0 =A0 =A0 =A0 =A0r8, [r0], r2 =A0 =A0 =A0 =A0 =A0 =A0; o4, pos=
t inc Bit lazy with the copy/paste. Should be: sxth r10, r10 ; sign extend 5 sxth r7, r7 ; sign extend 4 mov r8, r10, asr #3 ; 5 >> 3 strh r8, [r0, #2] ; o5 mov r8, r7, asr #3 ; 4 >> 3 strh r8, [r0], r2 ; o4, post inc
<johann.koenig@gmail.com> wrote in message news:53ea9945-3838-40b2-836d-2c8f08c30efa@t54g2000hsg.googlegroups.com...
> I'm attempting to pack some numbers for output after doing some work. > They're currently r7 = v0|v4 and r10 = v1|v5. They all need to be >>3, > before or after repacking. Output will be v1|v0 and v5|v4 (little > endian architecture). I managed to get the v1|v0 written reasonably > efficiently: > mov r8, r10, asr #3 ; 1>>3|xxx > pkhtb r8, r8, r7, asr #19 ; 1>>3|,0>>3 > str r8, [r0], r2 ; o1|o0, post inc > > But v5|v4 is a little ugly because I'm starting with the least > significant bits, so right shifting is going to drag in the bottom of > the upper word (right?). Right now I'm sign extending, then writing > individual shorts. > mov r8, r10, asr #3 ; 5 >> 3 > strh r8, [r0, #2] ; o5 > sxth r1, r1 ; > > sxth r7, r7 ; > mov r8, r7, asr #3 ; 4 >> 3 > strh r8, [r0], r2 ; o4, post inc > > I found http://forums.arm.com/index.php?showtopic=12823&pid=30383&st=0&#entry30383 > PKHBT R3, R1, R2, LSL #15 ; R3 = [R2>>1, R1] > PKHTB R3, R3, R1, ASR #1 ; R3 = [R2>>1, R1>>1] > However, that seems to rely on the input being full words. > > Is there a better way to do this?
An easy alternative would be to shift r10 and r7 left by 16 and then apply your first sequence. This way you save and instruction and use str. However the best option would be to avoid shifting at this stage. Unless it is the final result, delaying the shift until the next processing step might be cheaper. Another possibility is to use halving additions if you do any, so that the result is already shifted. Wilco
On Aug 13, 5:07=A0am, "Wilco Dijkstra"
<Wilco.removethisDijks...@ntlworld.com> wrote:
> An easy alternative would be to shift r10 and r7 left by 16 and then appl=
y
> your first sequence. This way you save and instruction and use str. > > However the best option would be to avoid shifting at this stage. Unless =
it
> is the final result, delaying the shift until the next processing step mi=
ght be
> cheaper. Another possibility is to use halving additions if you do any, s=
o
> that the result is already shifted. > > Wilco
Thanks for the tip. At first I thought it would use extra instructions to do the shift, but then I realized that would just replace the sign extends. Unfortunately, this is the only way I can do the operation. The shift has to be the last thing, and can't be pre-processed at the receiving end. New code saves 1 store per loop: mov r10, r10, lsl #16 ; 5|x mov r7, r7, lsl #16 ; 4|x mov r10, r10, asr #3 ; 5>>3|xxx pkhtb r10, r10, r7, asr #19 ; 5>>3|4>>3 str r10, [r0], r2 ; o5|o4, post inc You mentioned halving addition, but I can't find anything about that. It probably wouldn't help in this case, since the math goes like (x+y +4)>>3 or (x-y+4)>>3. I can't add 4 first because the subtraction is associative. -- -Johann