EmbeddedRelated.com
Forums

Fast bit-reverse on an x86?

Started by Tim Wescott November 2, 2010
On Nov 5, 10:54 am, "Harold Aptroot" <harold.aptr...@gmail.com> wrote:
> See here for a "similar" thing for popcnt (as it turns out, the PSHUFB > solution is faster than using the SSE4 POPCNT instruction) > http://wm.ite.pl/articles/sse-popcount.html > > The same trick can be used here, without even the "gather" stage that makes > the popcnt slightly awkward. > > I might write it like this (untesed and not tweaked and I'm not fully awake > yet) > > ; xmm0 = input (16 bytes to reverse) > ; xmm5 = magic lookup table Low (0, 8, 4 etc) > ; xmm6 = magic lookup table High (0, 80h, 40h, etc) > ; xmm7 = packed(0Fh) > ; the last two are input because this is expected to be the body of a loop > > movdqa xmm1, xmm0 > psrlw xmm0, 4 > > pand xmm1, xmm7 > pand xmm0, xmm7 > > movdqa xmm2, xmm6 > movdqa xmm3, xmm5 > > pshufb xmm2, xmm0 > pshufb xmm3, xmm1 > > por xmm2, xmm3 > > ; output = xmm2 > > Especially on an i7, that would be very fast. On a Core2 it'd be nice too. > That gives this routine plenty of potential users.
Well, that's very good. Without checking I can see it has the potential to be a lot faster - at least given certain conditions, i.e. bytes to be reversed stored next to each other, having to wait for 16 values, possible end cases if there are not multiples of 16 and presence of the pshufb instruction which requires the far from universal presence of SSSE3 according to http://en.wikipedia.org/wiki/SSSE3 Given the issues and the OP's problem description his own suggestion of a 256-byte lookup table (or a 16-byte one used twice) still seems best but he may want to consider the above so copying the reply back to the original groups. BTW, I'm not sure if you are aware but both your replies have dropped the original groups. Why choose to do that? Or is it a newsreader setting or restriction? James
"James Van Buskirk" <not_valid@comcast.net> wrote in message 
news:ib18lc$fm4$1@news.eternal-september.org...

> I think that the pseudocode:
> mov byte(al)
> is intended to be the super-slow XLAT instruction, and the:
> mov byte(ah)
> is intented to be the nonexistent XLATH instruction. Then:
> bswap dword (eax)
> is intended to cause a partial register stall via the instruction:
> bswap eax
> So he would need another bswap instruction (and another PRS) > to get the bytes back in order. Overall, perhaps an order of > magnitude slower than the SSE2 code I posted, once he gets > working code. Get the partial register stalls and other issues > worked out and still significantly slower.
I'm kind of rusty at FASM so I should I might as well time a couple of algorithms as an exercise. In the following, reverse_SSSE3 is the code using PSHUFB (my version, not Harold's), reverse_SSE2 is the divide and conquer code using shifting and masking, reverse_xlatb is the code using xlatb as proposed by Rod Pemberton, reverse_bytes and reverse_store are two codes intended to be similar to the code of aleksa. reverse_bytes accumulates results in a register then stores them en masse, whereas reverse_store stores each reversed byte as it is generated. OS, assembler and compiler: C:\gfortran\clf\reverse>ver Microsoft Windows [Version 5.2.3790] C:\gfortran\clf\reverse>fasm flat assembler version 1.67.18 usage: fasm <source> [output] optional settings: -m <limit> set the limit in kilobytes for the memory available to assembler -p <limit> set the maximum allowed number of passes C:\gfortran\clf\reverse>gfortran -v Built by Equation Solution <http://www.Equation.com>. Using built-in specs. COLLECT_GCC=gfortran COLLECT_LTO_WRAPPER=c:/gcc_equation/bin/../libexec/gcc/x86_64-pc-mingw32/4.6.0/l to-wrapper.exe Target: x86_64-pc-mingw32 Configured with: ../gcc-4.6-20100626-mingw/configure CFLAGS=-O0 --host=x86_64-pc -mingw32 --build=x86_64-unknown-linux-gnu --target=x86_64-pc-mingw32 --prefix=/h ome/gfortran/gcc-home/binary/mingw32/native/x86_64/gcc/4.6-20100626 --with-gmp=/ home/gfortran/gcc-home/binary/mingw32/native/x86_64/gmp --with-mpfr=/home/gfortr an/gcc-home/binary/mingw32/native/x86_64/mpfr --with-mpc=/home/gfortran/gcc-home /binary/mingw32/native/x86_64/mpc --with-sysroot=/home/gfortran/gcc-home/binary/ mingw32/cross/x86_64/gcc/4.6-20100626 --with-gcc --with-gnu-ld --with-gnu-as --d isable-shared --disable-nls --disable-tls --enable-libgomp --enable-languages=c, fortran,c++ --enable-threads=win32 --disable-win32-registry Thread model: win32 gcc version 4.6.0 20100626 (experimental) (GCC) C:\gfortran\clf\reverse>type reverse.asm format MS64 coff ; File: reverse.asm ; Assembled with: fasm reverse.asm section 'CODE' code readable executable align 16 align 16 public reverse_SSSE3 reverse_SSSE3: mov r10, rdx rdtsc shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rax, [r10+rdx] neg rdx jns loop1_end loop1: movdqa xmm0, [rcx+rdx] movdqa xmm1, [magic1] ; Assume 16 bytes to be reversed are in xmm0 pand xmm1, xmm0 ; xmm1 has high nibbles pxor xmm0, xmm1 ; xmm0 has low nibbles psrld xmm1, 4 ; Shift high nibbles to low position movdqa xmm2, [magic2] ; High reversed LUT pshufb xmm2, xmm0 ; xmm2 has high nibbles of reversed bytes movdqa xmm0, [magic3] ; Low reversed LUT pshufb xmm0, xmm1 ; xmm0 has low nibbles of reversed bytes por xmm0, xmm2 ; Now xmm0 has the 16 reversed bytes movdqa [rax+rdx], xmm0 add rdx, 16 js loop1 loop1_end: rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret public reverse_SSE2 reverse_SSE2: mov r10, rdx rdtsc shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rax, [r10+rdx] neg rdx jns loop2_end loop2: movdqa xmm0, [rcx+rdx] movdqa xmm1, [magic1] ; Assume 16 bytes to be reversed are in xmm0 pand xmm1, xmm0 pxor xmm0, xmm1 psrld xmm1, 4 pslld xmm0, 4 por xmm0, xmm1 ; Groups of 4 bits have been swapped movdqa xmm1, [magic4] pand xmm1, xmm0 pxor xmm0, xmm1 psrld xmm1, 2 pslld xmm0, 2 por xmm0, xmm1 ; Groups of 2 bits have been swapped movdqa xmm1, [magic5] pand xmm1, xmm0 pxor xmm0, xmm1 psrld xmm1, 1 paddd xmm0, xmm0 por xmm0, xmm1 ; Now xmm0 has the 16 reversed bytes movdqa [rax+rdx], xmm0 add rdx, 16 js loop2 loop2_end: rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret public reverse_xlatb reverse_xlatb: mov r10, rdx rdtsc push rbx push rsi lea rbx, [LUT8] shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rsi, [r10+rdx] neg rdx jns loop3_end loop3: mov eax, [rcx+rdx] db 48h xlatb rol eax, 8 db 48h xlatb rol eax, 8 db 48h xlatb rol eax, 8 db 48h xlatb rol eax, 8 mov [rsi+rdx], eax add rdx, 4 js loop3 loop3_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret public reverse_bytes reverse_bytes: mov r10, rdx rdtsc push rbx push rsi lea rbx, [LUT8] shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea r8, [r10+rdx] neg rdx jns loop4_end loop4: mov rax, [rcx+rdx] rol rax, 16 movzx esi, ah movzx esi, byte [rbx+rsi] shl esi, 8 movzx r10d, al movzx r10d, byte [rbx+r10] or r10d, esi rol rax, 16 shl r10d, 8 movzx esi, ah movzx esi, byte [rbx+rsi] or r10d, esi shl r10d, 8 movzx esi, al movzx esi, byte [rbx+rsi] or r10d, esi rol rax, 16 shl r10, 8 movzx esi, ah movzx esi, byte [rbx+rsi] or r10, rsi shl r10, 8 movzx esi, al movzx esi, byte [rbx+rsi] or r10, rsi rol rax, 16 shl r10, 8 movzx esi, ah movzx esi, byte [rbx+rsi] or r10, rsi shl r10, 8 movzx esi, al movzx esi, byte [rbx+rsi] lea rax, [rsi+r10] mov [r8+rdx], rax add rdx, 8 js loop4 loop4_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret public reverse_store reverse_store: mov r10, rdx rdtsc push rbx push rsi lea rbx, [LUT8] shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea r8, [r10+rdx] neg rdx jns loop5_end loop5: mov rax, [rcx+rdx] movzx esi, ah movzx esi, byte [rbx+rsi] mov [r8+rdx+1], sil movzx esi, al movzx esi, byte [rbx+rsi] mov [r8+rdx], sil shr rax, 16 movzx esi, ah movzx esi, byte [rbx+rsi] mov [r8+rdx+3], sil movzx esi, al movzx esi, byte [rbx+rsi] mov [r8+rdx+2], sil shr rax, 16 movzx esi, ah movzx esi, byte [rbx+rsi] mov [r8+rdx+5], sil movzx esi, al movzx esi, byte [rbx+rsi] mov [r8+rdx+4], sil shr rax, 16 movzx esi, ah movzx esi, byte [rbx+rsi] mov [r8+rdx+7], sil movzx esi, al movzx esi, byte [rbx+rsi] mov [r8+rdx+6], sil add rdx, 8 js loop5 loop5_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret section 'DATA' data readable writeable align 16 align 16 label magic1 dqword db 0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h db 0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h label magic2 dqword db 0h, 80h, 40h,0c0h, 20h,0a0h, 60h,0e0h db 10h, 90h, 50h,0d0h, 30h,0b0h, 70h,0f0h label magic3 dqword db 0h, 8h, 4h, 0ch, 2h, 0ah, 6h, 0eh db 1h, 9h, 5h, 0dh, 3h, 0bh, 7h, 0fh label magic4 dqword db 0cch,0cch,0cch,0cch,0cch,0cch,0cch,0cch db 0cch,0cch,0cch,0cch,0cch,0cch,0cch,0cch label magic5 dqword db 0aah,0aah,0aah,0aah,0aah,0aah,0aah,0aah db 0aah,0aah,0aah,0aah,0aah,0aah,0aah,0aah LUT8 db 0h, 80h, 40h,0C0h, 20h,0A0h, 60h,0E0h db 10h, 90h, 50h,0D0h, 30h,0B0h, 70h,0F0h db 8h, 88h, 48h,0C8h, 28h,0A8h, 68h,0E8h db 18h, 98h, 58h,0D8h, 38h,0B8h, 78h,0F8h db 4h, 84h, 44h,0C4h, 24h,0A4h, 64h,0E4h db 14h, 94h, 54h,0D4h, 34h,0B4h, 74h,0F4h db 0Ch, 8Ch, 4Ch,0CCh, 2Ch,0ACh, 6Ch,0ECh db 1Ch, 9Ch, 5Ch,0DCh, 3Ch,0BCh, 7Ch,0FCh db 2h, 82h, 42h,0C2h, 22h,0A2h, 62h,0E2h db 12h, 92h, 52h,0D2h, 32h,0B2h, 72h,0F2h db 0Ah, 8Ah, 4Ah,0CAh, 2Ah,0AAh, 6Ah,0EAh db 1Ah, 9Ah, 5Ah,0DAh, 3Ah,0BAh, 7Ah,0FAh db 6h, 86h, 46h,0C6h, 26h,0A6h, 66h,0E6h db 16h, 96h, 56h,0D6h, 36h,0B6h, 76h,0F6h db 0Eh, 8Eh, 4Eh,0CEh, 2Eh,0AEh, 6Eh,0EEh db 1Eh, 9Eh, 5Eh,0DEh, 3Eh,0BEh, 7Eh,0FEh db 1h, 81h, 41h,0C1h, 21h,0A1h, 61h,0E1h db 11h, 91h, 51h,0D1h, 31h,0B1h, 71h,0F1h db 9h, 89h, 49h,0C9h, 29h,0A9h, 69h,0E9h db 19h, 99h, 59h,0D9h, 39h,0B9h, 79h,0F9h db 5h, 85h, 45h,0C5h, 25h,0A5h, 65h,0E5h db 15h, 95h, 55h,0D5h, 35h,0B5h, 75h,0F5h db 0Dh, 8Dh, 4Dh,0CDh, 2Dh,0ADh, 6Dh,0EDh db 1Dh, 9Dh, 5Dh,0DDh, 3Dh,0BDh, 7Dh,0FDh db 3h, 83h, 43h,0C3h, 23h,0A3h, 63h,0E3h db 13h, 93h, 53h,0D3h, 33h,0B3h, 73h,0F3h db 0Bh, 8Bh, 4Bh,0CBh, 2Bh,0ABh, 6Bh,0EBh db 1Bh, 9Bh, 5Bh,0DBh, 3Bh,0BBh, 7Bh,0FBh db 7h, 87h, 47h,0C7h, 27h,0A7h, 67h,0E7h db 17h, 97h, 57h,0D7h, 37h,0B7h, 77h,0F7h db 0Fh, 8Fh, 4Fh,0CFh, 2Fh,0AFh, 6Fh,0EFh db 1Fh, 9Fh, 5Fh,0DFh, 3Fh,0BFh, 7Fh,0FFh C:\gfortran\clf\reverse>type driver.f90 program main use ISO_C_BINDING implicit none interface function reverse_SSSE3(source, dest, size) & bind(C,name='reverse_SSSE3') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_SSSE3 end function reverse_SSSE3 function reverse_SSE2(source, dest, size) & bind(C,name='reverse_SSE2') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_SSE2 end function reverse_SSE2 function reverse_xlatb(source, dest, size) & bind(C,name='reverse_xlatb') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_xlatb end function reverse_xlatb function reverse_bytes(source, dest, size) & bind(C,name='reverse_bytes') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_bytes end function reverse_bytes function reverse_store(source, dest, size) & bind(C,name='reverse_store') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_store end function reverse_store function rev(x) use ISO_C_BINDING implicit none integer(C_INT8_T), value :: x integer(C_INT8_T) rev end function rev end interface integer(C_INT64_T), parameter :: blocksize = 4096 integer(C_INT8_T) buff(blocksize,4) integer i integer(C_INT64_T) time do i = 1, blocksize buff(i,1) = i-1 buff(i,2) = rev(buff(i,1)) end do write(*,'(a,i0,a)') 'Testing SSSE3, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_SSSE3(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then write(*,'(a)') 'Algorithm failed' else buff(:,3) = buff(:,2) time = time+reverse_SSSE3(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do write(*,'(/a,i0,a)') 'Testing SSE2, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_SSE2(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then write(*,'(a)') 'Algorithm failed' else buff(:,3) = buff(:,2) time = time+reverse_SSE2(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do write(*,'(/a,i0,a)') 'Testing xlatb, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_xlatb(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then else buff(:,3) = buff(:,2) time = time+reverse_xlatb(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do write(*,'(/a,i0,a)') 'Testing bytes, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_bytes(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then write(*,'(a)') 'Algorithm failed' else buff(:,3) = buff(:,2) time = time+reverse_bytes(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do write(*,'(/a,i0,a)') 'Testing store, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_store(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then write(*,'(a)') 'Algorithm failed' else buff(:,3) = buff(:,2) time = time+reverse_store(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do end program main function rev(x) use ISO_C_BINDING implicit none integer(C_INT8_T), value :: x integer(C_INT8_T) rev integer i rev = 0 do i = 0, bit_size(x)-1 rev = ishft(rev, 1) rev = IOR(rev,IAND(x,1_C_INT8_T)) x = ishft(x,-1) end do end function rev C:\gfortran\clf\reverse>fasm reverse.asm flat assembler version 1.67.18 (1297254 kilobytes memory) 3 passes, 1449 bytes. C:\gfortran\clf\reverse>gfortran driver.f90 reverse.obj -oreverse C:\gfortran\clf\reverse>reverse Testing SSSE3, 2*4096 bytes time = 6680 time = 3350 time = 3340 time = 3400 time = 3390 time = 3390 time = 3390 time = 3390 time = 3390 time = 3400 Testing SSE2, 2*4096 bytes time = 4530 time = 4770 time = 4770 time = 4760 time = 4760 time = 4760 time = 4770 time = 4760 time = 4760 time = 4760 Testing xlatb, 2*4096 bytes time = 33100 time = 32950 time = 32950 time = 32970 time = 32970 time = 32970 time = 32960 time = 32970 time = 32960 time = 32970 Testing bytes, 2*4096 bytes time = 15680 time = 15620 time = 15610 time = 15610 time = 15610 time = 15610 time = 15620 time = 15620 time = 15620 time = 15620 Testing store, 2*4096 bytes time = 13180 time = 12490 time = 12510 time = 12490 time = 12510 time = 12510 time = 12510 time = 12510 time = 12510 time = 12510 So reverse_SSSE3 takes 3390/(2*4096) = 0.41 clocks/byte, reverse_SSE2 takes 4760/(2*4096) = 0.58 clocks/byte, reverse_xlatb takes 32960/(2*4096) = 4.02 clocks/byte, reverse_bytes takes 15620/(2*4096) = 1.91 clocks/byte, and reverse_store takes 12510/(2*4096) = 1.53 clocks/byte. Even if the speed of the LUT method could be doubled by going to a 131072-byte table, which it can't, it would still be slower then the SSE2 divide-and-conquer method on my Core 2 Duo. Rod Pemberton's xlatb method is about an order of magnitude slower than the SSSE3 PSHUFB method. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
In article <f79c61e2-acf7-45a3-b27e-a61827d5029e@g25g2000yqn.googlegroups.com>,
James Harris  <james.harris.1@googlemail.com> wrote:
>On 3 Nov, 05:51, Tim Wescott <t...@seemywebsite.com> wrote: >> On 11/02/2010 06:00 PM, Vladimir Vassilevsky wrote: >> >> >> >> > Tim Wescott wrote: >> >> Doing some work on a PC. Customer says it's not real time, but real >> >> fast is still better than real slow. I need to bit-reverse a slew of >> >> bytes, so the faster the better. >> >> > What do you do? Mirroring some bitmaps? >> >> It's for a bunch of data that's recorded off of a 1-bit ADC -- they're >> changing their frequency plan and need things converted, and by the way >> they're changing from lsb first to msb first. > >On x86-32 your first idea - the table lookup - should compile to >something like > > and eax, 0xff > mov al, [bit_reverse_table + eax]
Come on! MOV BX, bit_reverse_table XLAT Intel didn't spend a whole single byte instruction for nothing. (That is .4% of the instruction space!)
> >James
Groetjes Albert -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Albert van der Horst answered James Harris:
...
>> and eax, 0xff >> mov al, [bit_reverse_table + eax]
> Come on! > > MOV BX, bit_reverse_table > XLAT
> Intel didn't spend a whole single byte instruction for nothing. > (That is .4% of the instruction space!)
You're right in terms of code-size. But the OP asked for speed ... XLAT is an awful slow relict from 8080 days (vector path, latency = 5) __ wolfgang
On Nov 7, 6:00=A0am, "James Van Buskirk" <not_va...@comcast.net> wrote:
> "James Van Buskirk" <not_va...@comcast.net> wrote in messagenews:ib18lc$f=
m4$1@news.eternal-september.org...
>
...
> So reverse_SSSE3 takes 3390/(2*4096) =3D 0.41 clocks/byte, > reverse_SSE2 takes 4760/(2*4096) =3D 0.58 clocks/byte, > reverse_xlatb takes 32960/(2*4096) =3D 4.02 clocks/byte, > reverse_bytes takes 15620/(2*4096) =3D 1.91 clocks/byte, and > reverse_store takes 12510/(2*4096) =3D 1.53 clocks/byte. > > Even if the speed of the LUT method could be doubled by going > to a 131072-byte table, which it can't, it would still be slower > then the SSE2 divide-and-conquer method on my Core 2 Duo. =A0Rod > Pemberton's xlatb method is about an order of magnitude slower > than the SSSE3 PSHUFB method.
This is not comparing like with like if the conditions of the tests are set up to favour SSE. I agree there are speed advantages to SSE code but for a real comparison, while the translation tables can be pre-cached (to simulate prior iterations of the loop), values to be converted should be picked up from memory. The OP didn't specify that bytes to be converted were neatly adjacent and in groups of 16. A better test would be to pick them up individually from, say, 53 bytes apart. Are you sure Rod specified xlat? AFAICS he suggested a 256-byte translation table and move instructions. He used bswap so he could effectively operate on the upper half of EAX. He could have used ror eax, 16 or ror eax, 16 instead. I ran a few tests of xlat on a Pentium M. It was slowish at 4 cycles per use when the CPU had to wait for the result of the translation (which goes back into AL) before it could initiate the next xlat. Timing of xlat was similar to running the operation explicitly as in: mov al, [table + eax] ;Avg of 4 cycles per use However, it was faster to repeat the following. mov bl, [table + eax] ;Fairly consistent 1 cycle per use In other words, as long as xlat didn't have a dependency chain it was fast. To confirm I tried mov eax, <something> xlat These ran repeatedly at 1 cycle for each pair. So xlat is *not* inherently slow on the Pentium M. This is possibly true for any P6 machine. I would normally still use the explicit instruction in code, though, because: xlat may be slower on older machines such as the Pentium 1, explicit allows the dependency to be broken and explicit makes the code clearer. James
In comp.dsp James Harris <james.harris.1@googlemail.com> wrote:
(snip)

> Are you sure Rod specified xlat? AFAICS he suggested a 256-byte > translation table and move instructions. He used bswap so he could > effectively operate on the upper half of EAX. He could have used ror > eax, 16 or ror eax, 16 instead.
> I ran a few tests of xlat on a Pentium M. It was slowish at 4 cycles > per use when the CPU had to wait for the result of the translation > (which goes back into AL) before it could initiate the next xlat. > Timing of xlat was similar to running the operation explicitly as in:
(snip) I remember stories from when the PPro came out that it was slower on code mixing registers sizes. That was most noticable at the time running 16 bit MS-DOS code.
> These ran repeatedly at 1 cycle for each pair. So xlat is *not* > inherently slow on the Pentium M. This is possibly true for any P6 > machine. I would normally still use the explicit instruction in code, > though, because: xlat may be slower on older machines such as the > Pentium 1, explicit allows the dependency to be broken and explicit > makes the code clearer.
I don't remember the rules now for what is fast and what isn't, but much of the P6 core was kept in later processors. It seems that intel was expecting full conversion to 32 bit code by the time the PPro became popular. -- glen
"James Harris" <james.harris.1@googlemail.com> wrote in message 
news:e28cae45-52d2-4e1e-a361-01eb61f7731f@q18g2000vbm.googlegroups.com...

> On Nov 7, 6:00 am, "James Van Buskirk" <not_va...@comcast.net> wrote:
> > So reverse_SSSE3 takes 3390/(2*4096) = 0.41 clocks/byte, > > reverse_SSE2 takes 4760/(2*4096) = 0.58 clocks/byte, > > reverse_xlatb takes 32960/(2*4096) = 4.02 clocks/byte, > > reverse_bytes takes 15620/(2*4096) = 1.91 clocks/byte, and > > reverse_store takes 12510/(2*4096) = 1.53 clocks/byte.
> > Even if the speed of the LUT method could be doubled by going > > to a 131072-byte table, which it can't, it would still be slower > > then the SSE2 divide-and-conquer method on my Core 2 Duo. Rod > > Pemberton's xlatb method is about an order of magnitude slower > > than the SSSE3 PSHUFB method.
> This is not comparing like with like if the conditions of the tests > are set up to favour SSE. I agree there are speed advantages to SSE > code but for a real comparison, while the translation tables can be > pre-cached (to simulate prior iterations of the loop), values to be > converted should be picked up from memory. The OP didn't specify that > bytes to be converted were neatly adjacent and in groups of 16. A > better test would be to pick them up individually from, say, 53 bytes > apart.
> Are you sure Rod specified xlat? AFAICS he suggested a 256-byte > translation table and move instructions. He used bswap so he could > effectively operate on the upper half of EAX. He could have used ror > eax, 16 or ror eax, 16 instead.
Go back and look at his opcounts. He had pseudocode that addressed off of a byte register, stored the result back in the same byte register, and did so in a single instruction. If he wants to prove otherwise he should man up and rig his own benchmarks to do so. Until then as author of the test bench I get to set the conditions of the problem. Thinking about it since then it seems to me that it's more realistic, since the problem arose in comp.arch.embedded, to assume that the data is coming in maybe one byte at a time and being acquired in interrupt-driven code, perhaps via: in al, dx so that you only have the opportunity to process one byte at a time and none of the code or data that the code will use is in cache. In that case any LUT method will require loading an extra cache line, making it slow pretty much no matter what. In the following, reverse_small is the divide-and-conquer method applied to a single byte, reverse_mul is the multiplication method from: http://www-graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64Bits where the scatter mask and bit select mask have been shifted right by one to avoid a 64-bit data load. reverse_loop is the very low code length method that shifts one bit at a time, reverse_loop1 is the same method but shifting the bit through a register instead of CF, reverse_SSE2a is the same method again but processing 16 bytes at once (I know...) and reverse_mul1 is the multiplication method revisited with with gather mask also shifted left by two to avoid another 64-bit load. New code (see http://groups.google.com/group/comp.dsp/msg/ae7846176eda39fc?hl=en for old code and driver program): public reverse_small ; 29 bytes to bit-reverse al reverse_small: mov r10, rdx rdtsc push rbx push rsi shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rbx, [r10+rdx] neg rdx jns loop6_end loop6: mov al, [rcx+rdx] movzx esi, al and eax, 55h xor esi, eax shr esi, 1 lea eax, [2*rax+rsi] mov esi, eax and eax, 0ffffffcch xor esi, eax shr eax, 2 lea eax, [4*rsi+rax] ror al, 4 mov [rbx+rdx], al add rdx, 1 js loop6 loop6_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret public reverse_mul ; 38 bytes to bit-reverse al (high bits of rax precleared) reverse_mul: mov r10, rdx rdtsc push rbx push rsi shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rbx, [r10+rdx] neg rdx jns loop7_end loop7: movzx eax, byte [rcx+rdx] imul rax, 40100401h mov rsi, 442211088h and rax, rsi mov rsi, 202020202h imul rax, rsi shr rax, 32 mov [rbx+rdx], al add rdx, 1 js loop7 loop7_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret public reverse_loop ; 11 bytes to bit-reverse al reverse_loop: mov r10, rdx rdtsc push rbx shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 lea r10, [rcx+rdx] lea r8, [r10+rdx] neg rdx jns loop8_end loop8: mov al, [r10+rdx] mov cl, 8 loop8_inner: shr al, 1 adc bl, bl sub cl, 1 jnz loop8_inner mov [r8+rdx], bl add rdx, 1 js loop8 loop8_end: pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret public reverse_loop1 ; 17 bytes to bit-reverse al to bl (high bits of rax precleared) reverse_loop1: mov r10, rdx rdtsc push rbx push rsi shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 lea r10, [rcx+rdx] lea r8, [r10+rdx] neg rdx jns loop9_end loop9: movzx eax, byte [r10+rdx] mov cl, 8 loop9_inner: mov esi, eax shr eax, 1 and esi, 1h lea ebx, [2*rbx+rsi] sub cl, 1 jnz loop9_inner mov [r8+rdx], bl add rdx, 1 js loop9 loop9_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret public reverse_SSE2a reverse_SSE2a: mov r10, rdx rdtsc shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp push rbx mov rdx, r8 shl rdx, 4 lea rbx, [rcx+rdx] lea rax, [r10+rdx] neg rdx jns loop10_end loop10: movdqa xmm0, [rbx+rdx] mov cl, 8 loop10_inner: movdqa xmm2, [magic6] pand xmm2, xmm0 psrld xmm0, 1 paddb xmm1, xmm1 por xmm1, xmm2 sub cl, 1 jnz loop10_inner movdqa [rax+rdx], xmm1 add rdx, 16 js loop10 loop10_end: pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret public reverse_mul1 ; 34 bytes to bit-reverse al (high bits of rax precleared) reverse_mul1: mov r10, rdx rdtsc push rbx push rsi shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rbx, [r10+rdx] neg rdx jns loop11_end loop11: movzx eax, byte [rcx+rdx] imul rsi, rax, 40100401h mov rax, 442211088h and rsi, rax sub rax, 3e1d0c84h imul rax, rsi shr rax, 33 mov [rbx+rdx], al add rdx, 1 js loop11 loop11_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret section 'DATA' data readable writeable align 16 align 16 label magic6 dqword db 1h, 1h, 1h, 1h, 1h, 1h, 1h, 1h db 1h, 1h, 1h, 1h, 1h, 1h, 1h, 1h with results: C:\gfortran\clf\reverse>reverse Testing SSSE3, 2*4096 bytes time = 6800 time = 3390 time = 3380 time = 3360 time = 3340 time = 3340 time = 3340 time = 3360 time = 3390 time = 3360 Testing SSE2, 2*4096 bytes time = 4540 time = 4610 time = 4780 time = 4800 time = 4790 time = 4770 time = 4790 time = 4770 time = 4780 time = 4760 Testing xlatb, 2*4096 bytes time = 33020 time = 33020 time = 32980 time = 32980 time = 33010 time = 32990 time = 32990 time = 32980 time = 32980 time = 32980 Testing bytes, 2*4096 bytes time = 15640 time = 15690 time = 15690 time = 15650 time = 15680 time = 15660 time = 15700 time = 15670 time = 15660 time = 15710 Testing store, 2*4096 bytes time = 20990 time = 21280 time = 12650 time = 12640 time = 12620 time = 12680 time = 12590 time = 12660 time = 12770 time = 12640 Testing small, 2*4096 bytes time = 78320 time = 78330 time = 78290 time = 78290 time = 78320 time = 78270 time = 78230 time = 78310 time = 78230 time = 78290 Testing mul, 2*4096 bytes time = 55650 time = 33360 time = 33310 time = 33320 time = 33340 time = 33350 time = 33290 time = 33290 time = 33330 time = 33390 Testing loop, 2*4096 bytes time = 196910 time = 196810 time = 196850 time = 196820 time = 196820 time = 209150 time = 196910 time = 196850 time = 196840 time = 196800 Testing loop1, 2*4096 bytes time = 219790 time = 219790 time = 219580 time = 219770 time = 366270 time = 219620 time = 219760 time = 219790 time = 219580 time = 219720 Testing SSE2a, 2*4096 bytes time = 13140 time = 13130 time = 13130 time = 13080 time = 13020 time = 13020 time = 13020 time = 13020 time = 13050 time = 13130 Testing mul1, 2*4096 bytes time = 33210 time = 33200 time = 33070 time = 33050 time = 33050 time = 33080 time = 33060 time = 33030 time = 33060 time = 33060 So reverse_small uses 29 bytes to process bytes in 78300/8192 = 9.56 clocks/byte reverse_mul uses 38 bytes for 33300/8192 = 4.06 clocks/byte reverse_loop uses 11 bytes for 197000/8192 = 24.04 clocks/byte reverse_loop1 uses 17 bytes for 220000/8192 = 26.86 clocks/byte and reverse_mul1 uses 34 bytes for 33100/8192 = 4.04 clocks/byte reverse_SSE2a really doesn't belong with this group because it processes multiple bytes and uses memory, but it takes 13100/8192 = 1.60 clocks/byte. That reverse_mul1 algorithm looks pretty hot, using only about a half cache line of code and no data and crunching out the reversal in 4 clocks (although this is reciprocal throughput, not latency) on my Core 2 Duo.
> I ran a few tests of xlat on a Pentium M. It was slowish at 4 cycles > per use when the CPU had to wait for the result of the translation > (which goes back into AL) before it could initiate the next xlat. > Timing of xlat was similar to running the operation explicitly as in:
> mov al, [table + eax] ;Avg of 4 cycles per use
> However, it was faster to repeat the following.
> mov bl, [table + eax] ;Fairly consistent 1 cycle per use
> In other words, as long as xlat didn't have a dependency chain it was > fast. To confirm I tried
> mov eax, <something> > xlat
> These ran repeatedly at 1 cycle for each pair. So xlat is *not* > inherently slow on the Pentium M. This is possibly true for any P6 > machine. I would normally still use the explicit instruction in code, > though, because: xlat may be slower on older machines such as the > Pentium 1, explicit allows the dependency to be broken and explicit > makes the code clearer.
You can also look in http://www.agner.org/optimize/instruction_tables.pdf to see what timings Agner Fog has measured for the various processors. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
On 11/08/2010 02:42 PM, Albert van der Horst wrote:

> MOV BX, bit_reverse_table > XLAT > > Intel didn't spend a whole single byte instruction for nothing. > (That is .4% of the instruction space
James, wait wait.. Intel invented that instructions back in the 8086 days.. That's a loooong time ago. It was a *killler* instruction back then. We didn't had much (if any) instruction-cache, so instruction size had a direct effect on the performance. Nowadays the instruction still exist, but no modern compiler that I'm aware of uses it. It exists for backward compatibility. That instruction is a candidate for slow microcode. Cheers, Nils
I don't understand something about this thread: why not use
FFTW's implementation and be done with it? It may be fun to
reinvent the wheel, but it's not profitable.

--Randy


Nils <n.pipenbrinck@cubic.org> writes:

> On 11/08/2010 02:42 PM, Albert van der Horst wrote: > >> MOV BX, bit_reverse_table >> XLAT >> >> Intel didn't spend a whole single byte instruction for nothing. >> (That is .4% of the instruction space > > James, wait wait.. > > Intel invented that instructions back in the 8086 days.. That's a > loooong time ago. > > It was a *killler* instruction back then. We didn't had much (if any) > instruction-cache, so instruction size had a direct effect on the > performance. > > Nowadays the instruction still exist, but no modern compiler that I'm > aware of uses it. It exists for backward compatibility. That instruction > is a candidate for slow microcode. > > Cheers, > Nils
-- Randy Yates % "How's life on earth? Digital Signal Labs % ... What is it worth?" mailto://yates@ieee.org % 'Mission (A World Record)', http://www.digitalsignallabs.com % *A New World Record*, ELO
On Nov 11, 7:19=A0pm, Nils <n.pipenbri...@cubic.org> wrote:
> On 11/08/2010 02:42 PM, Albert van der Horst wrote: > > > =A0 =A0MOV BX, bit_reverse_table > > =A0 =A0XLAT > > > Intel didn't spend a whole single byte instruction for nothing. > > (That is .4% of the instruction space > > James, wait wait..
What for? Did you intend to address the poster, Albert, who suggested xlat?
> > Intel invented that instructions back in the 8086 days.. That's a > loooong time ago.
Albert's reference to BX suggests he may be still there :-)
> > It was a *killler* instruction back then. We didn't had much (if any) > instruction-cache, so instruction size had a direct effect on the > performance. > > Nowadays the instruction still exist, but no modern compiler that I'm > aware of uses it. It exists for backward compatibility. That instruction > is a candidate for slow microcode.
Based on tests I made some relevant comments on this in my prior post on this thread (q.v.). James