Reply by Don Y November 29, 20112011-11-29
On 11/28/2011 10:47 AM, Don Y wrote:
> On 10/25/2011 3:06 AM, Noob wrote:
>> Then, all that's left to do is to add the final carry bit, and >> "fold" r0 down to 16 bits. > > .... assuming, of course, that you are operating over > a buffer having a size that is an *exact* multiple > of 4.
Grrrr... s.b. "4 octets"! :<
Reply by Don Y November 28, 20112011-11-28
On 10/25/2011 3:06 AM, Noob wrote:

> I'm working with a 450-MHz SH-4 core. The architecture reminds me > of the original Pentium in several ways, in that it is dual-pipeline > with some restrictions on which types of instructions can be paired. > > When the system is receiving a 100-Mbps TCP stream (thus, at the same > time also sending roughly 2 Mbps of ACKs) approximately 13-15% of the > CPU is busy computing the internet checksum (C code compiled with > gcc-4.3.4 -Os) > > This algorithm has several nice properties, e.g. it can be computed > ignoring endian-ness, and if one is careful with the carry, it can be > computed 32-bits (or 64-bits on a 64-bitter) at a time. > > So, I was trying to write some Q&D assembly to see if I could do better > than gcc, using addc (add with carry). > > The gist of my pseudo-code was > > r4 = address of the data to checksum > r5 = count of words in the data (assume> 0) > t is the carry bit (don't know why it's called "t") > > clrt ; t<- 0 > mov #0, r0 ; r0<- 0 > loop: > mov.l @r4+, r1 ; r1<- [r4] and post-increment r4 > addc r1, r0 ; r0<- r0 + r1 + t and t<- carry > if --r5> 0 goto loop > > Then, all that's left to do is to add the final carry bit, and > "fold" r0 down to 16 bits.
... assuming, of course, that you are operating over a buffer having a size that is an *exact* multiple of 4.
> The major problem with my pseudo-implementation is the looping part, > because conditional branching is based on the carry bit, but I need > the carry bit preserved from one iteration to the next :-(
Are you sure the decrement operation touches the carry? Often, CY is unaffected but Z/NZ is signaled, instead. (by contrast, SUBTRACTING 1 when you *really* want to affect CY)
> Typical looping code would be > dt r5 ; decrement r5 and set t if r5 == 0 > bf loop ; branch to loop if t == 0 > > I don't see a way out of this jam, do you? > > I'm stuck with something along these lines > > clrt ; t<- 0 > mov #0, r0 ; r0<- 0 > loop: > mov.l @r4+, r1 ; r1<- [r4] and post-increment r4 > addc r1, r0 ; r0<- r0 + r1 + 0 and t<- carry > addc #0, r0 ; r0<- r0 + t and t<- 0 > dt r5 ; decrement r5 and set t if r5 == 0 > bf loop ; branch to loop if t == 0 > > I could unroll the loop a few times (most of the packets are > 1480-bytes long) to save as many "premature" carry adds, but > I don't really like this solution.
Why? Are you that pressed for space? Checksums are a *huge* bit of overhead *if* (!) you need to support them! Think hard on whether you want to skimp on space here instead of someplace less critical (i.e., that isn't being executed as frequently) Depending on the cache available in the processor, I typically unroll the loop into different sized pieces and process N, 2N, 4N, ... "units" at a time (this also takes care of the *actual* size of the dataset instead of requiring it to be an exact multiple of 32 bits) E.g., (incredibly bogus pseudo-code): while (remaining >= 4N) { ADD *ptr++ // 1 ADC *ptr++ // 2 ... ADC *ptr++ // 4N ADC 0 // fold in final CY remaining -= 4N } while (remaining >= 2N) { ADD *ptr++ // 1 ADC *ptr++ // 2 ... ADC *ptr++ // 2N ADC 0 // fold in final CY remaining -= 2N } while (remaining > N) { ADD *ptr++ // 1 ADC *ptr++ // 2 ... ADC *ptr++ // N ADC 0 // fold in final CY remaining -= N } If the CPU has alignment restrictions, you need to handle those up front to avoid faults, etc. If you exercise some care in how you fabricate your mbufs, you can sometimes skip this. If you really want to cut down on TEXT, you can represent "N" as "chunksize" and halve it each time a "while" fails (then use a switch to skip over a certain number of ADC's based on the value of chunksize). Something like: chunksize = 4N while (chunksize > 1) { ADD *ptr while (remaining > chunksize) { switch (chunksize) { case 4N: ADC *ptr++ // 4N ... ADC *ptr++ // 2N+1 // fall thru case 2N: ADC *ptr++ // 2N ... ADC *ptr++ // N+1 // fall thru case N: ADC *ptr++ // N ... ADC *ptr++ // 1 } ADC 0 // fold in final CY } remaining -= chunksize chunksize /= 2 } [ignore the discrepancies in my counting, I'm just trying to illustrate a general approach :> ]
> Am I missing something obvious, or something not so obvious, > that would get me a nice, compact, and fast solution? :-)
Make sure you consider when and why you are computing the checksums! Depending on your application and the facilities that your stack provides, you might consider a more "undisciplined" API for all of this. I.e., if you can *omit* the computation, you save even bigger! [Layering can be A Bad Thing] E.g., you might want to restrict the portion of the packet/buffer on which you perform this computation. I.e., you may opt NOT to check some of the payload (depends on your application) -- this is becoming common in UPnP type applications, for example. You *also* might want to make some of this data *visible* to your application! E.g., if our application is going to checksum some larger "transmission unit", then it could possibly benefit from the computations that you are performing, here. This is the flipside of letting each encapsulation layer IN THE STACK benefit from computations made in other layers (i.e., incrementally updating the sums) Alternatively, if your application is going to make some particular change to the packet and then return/forward it, you maybe able to benefit from these previous computations. Reconsider whether a strict TCP implementation is necessary or if you can get what you want from other protocols -- even if you have to add some bit of "acknowledgement" in the application layer. Lastly, look at what you are going to *do* in the event of a failed checksum -- if *all* you are EFFECTIVELY doing is updating some counters/statistics, then REconsider whether you actually need to invest this effort.
Reply by segher October 26, 20112011-10-26
On Oct 25, 3:26=A0pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> One alternative approach is to use shift & mask to split each 16-byte > block into two halves and aggregate them separately, this will work for > up to 128K packet sizes, which is well above what TCP will allow: > > next16: > =A0 =A0 movdqa xmm7,[esi] > =A0 =A0 lea esi,[esi+16] > =A0 =A0 movdqa xmm6,xmm2 =A0 =A0; 4 x 0x0000ffff masks > =A0 =A0 pand xmm6,xmm7 > =A0 =A0 psrld xmm7,16 > =A0 =A0 paddd xmm0,xmm6 > =A0 =A0 paddd xmm0,xmm7 > =A0 =A0 sub ecx,1 > =A0 =A0 =A0jnz next16 > > Notice that this approach needs one additional SIMD instruction per > 16-byte block.
Another way is to accumulate both the data, and separately the data shifted right by some bits (say 16 in your case). Since you add at most 65536 items, that second accumulator differs from the top 32 bits of a true 48-bit accumulator by less than 65536, so you can reconstruct it. In pseudo-C: u32 acc_hi =3D 0; u32 acc_lo =3D 0; for (j =3D 0; j < N; j++) { acc_lo +=3D data[j]; acc_hi +=3D data[j] >> 16; } acc_hi +=3D ((acc_lo >> 16) - acc_hi) & 0xffff; // now acc_hi is the top 32 bits of the 48-bit accumulator, and acc_lo // is the low 32 bits This saves two instructions from your SSE loop, but it's much better on most RISC.
Reply by dp October 25, 20112011-10-25
On Oct 25, 4:26 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> ... > On any architecture without this particular bug/feature/glitch the best > you can do is almost certainly to unroll by a small amount and then keep > a separate carry accumulator like your r0 above for use at the end.
Here is how I did it on power 6-7 years ago: http://tgi-sci.com/vpaex/cksm3456.sa (or cksm3456.txt for the native code list). Here is the critical part: * deal with the bulk of the data using .l accesses lsr.l+ #2,d2,d3 number of .l words move.l- d3,ctr setup counter rlwinm- d2,#0,#32-2,#31,d2 number of bytes after last .l neg.l- d2,d1 residue in next segment rlwinm- d1,#0,#32-2,#31,d1 must be 0-3 beq nulllngc if none .l to process lea (-4,a0),a0 to use preincrement set loop,* adde.l- =(4,a0),d0 collect checksum like a champion brc ?unc.dbnz,loop all of the .l lea (4,a0),a0 will use postincrement now nulllngc Never thought of this as of something consuming much resources, I don't think it does really. I have never measured it (not seeing an issue there), in a loop that tight the only stall can come from data dependencies. And they probably do kick in but wrestling them would be too major and with doubtful results for 1500 bytes (setup/registers save/restore etc). Dimiter ------------------------------------------------------ Dimiter Popoff Transgalactic Instruments http://www.tgi-sci.com ------------------------------------------------------ http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
Reply by Terje Mathisen October 25, 20112011-10-25
Arlet Ottens wrote:
> On 10/25/2011 12:06 PM, Noob wrote: > >> I'm stuck with something along these lines >> >> clrt ; t<- 0 >> mov #0, r0 ; r0<- 0 >> loop: >> mov.l @r4+, r1 ; r1<- [r4] and post-increment r4 >> addc r1, r0 ; r0<- r0 + r1 + 0 and t<- carry >> addc #0, r0 ; r0<- r0 + t and t<- 0 >> dt r5 ; decrement r5 and set t if r5 == 0 >> bf loop ; branch to loop if t == 0 >> >> I could unroll the loop a few times (most of the packets are >> 1480-bytes long) to save as many "premature" carry adds, but >> I don't really like this solution. > > Why not ? Unrolling this loop will give the best results. Even unrolling > by a factor of 8 should make the carry overhead insignificant, and only > increase code size by a small amount.
Arlet, you're absolutely correct here. This particular algorithm matches up perfectly with the original x86 instruction set, where INC and DEC avoids updating the carry flag, so that you can use DEC / JNZ or even JCXNZ to do loop control while the carry flag survives from round to round. On any architecture without this particular bug/feature/glitch the best you can do is almost certainly to unroll by a small amount and then keep a separate carry accumulator like your r0 above for use at the end. On machines without carry flags, or when working with SIMD registers, you can do the same using explicit parallel compare instructions, then aggregating the flag values resulting from those compares. I.e. the following loop should handle 32 bytes per iteration, using 10 SSE instructions, so 3+ bytes/cycle without having to dual-issue the SIMD opcodes: (The main tweak is the need to bias all input values, from [0..64K> to [-32K..32K> so that signed compares will work.) next32: pxor xmm7,[esi] ; Grab next 8 words, invert sign bits pxor xmm6,[esi+16] ; and another 16 lea esi,[esi+32] paddw xmm0,xmm7 pcmpgtw xmm7,xmm0 ; Set flag if src > sum paddw xmm0,xmm6 pcmpgtw xmm6,xmm0 ; ditto psubw xmm1,xmm7 ; Subtract -1 flag -> increment movdqa xmm7,xmm2 ; Load sign bits psubw xmm1,xmm6 movdqa xmm6,xmm2 sub ecx,1 jnz next32 One alternative approach is to use shift & mask to split each 16-byte block into two halves and aggregate them separately, this will work for up to 128K packet sizes, which is well above what TCP will allow: next16: movdqa xmm7,[esi] lea esi,[esi+16] movdqa xmm6,xmm2 ; 4 x 0x0000ffff masks pand xmm6,xmm7 psrld xmm7,16 paddd xmm0,xmm6 paddd xmm0,xmm7 sub ecx,1 jnz next16 Notice that this approach needs one additional SIMD instruction per 16-byte block. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
Reply by Arlet Ottens October 25, 20112011-10-25
On 10/25/2011 12:06 PM, Noob wrote:

> I'm stuck with something along these lines > > clrt ; t<- 0 > mov #0, r0 ; r0<- 0 > loop: > mov.l @r4+, r1 ; r1<- [r4] and post-increment r4 > addc r1, r0 ; r0<- r0 + r1 + 0 and t<- carry > addc #0, r0 ; r0<- r0 + t and t<- 0 > dt r5 ; decrement r5 and set t if r5 == 0 > bf loop ; branch to loop if t == 0 > > I could unroll the loop a few times (most of the packets are > 1480-bytes long) to save as many "premature" carry adds, but > I don't really like this solution.
Why not ? Unrolling this loop will give the best results. Even unrolling by a factor of 8 should make the carry overhead insignificant, and only increase code size by a small amount.
Reply by Noob October 25, 20112011-10-25
Hello,

I'm working with a 450-MHz SH-4 core. The architecture reminds me
of the original Pentium in several ways, in that it is dual-pipeline
with some restrictions on which types of instructions can be paired.

When the system is receiving a 100-Mbps TCP stream (thus, at the same
time also sending roughly 2 Mbps of ACKs) approximately 13-15% of the
CPU is busy computing the internet checksum (C code compiled with
gcc-4.3.4 -Os)

This algorithm has several nice properties, e.g. it can be computed
ignoring endian-ness, and if one is careful with the carry, it can be
computed 32-bits (or 64-bits on a 64-bitter) at a time.

So, I was trying to write some Q&D assembly to see if I could do better
than gcc, using addc (add with carry).

The gist of my pseudo-code was

r4 = address of the data to checksum
r5 = count of words in the data (assume > 0)
t is the carry bit (don't know why it's called "t")

  clrt            ;  t <- 0
  mov #0, r0      ; r0 <- 0
loop:
  mov.l @r4+, r1  ; r1 <- [r4] and post-increment r4
  addc  r1, r0    ; r0 <- r0 + r1 + t and t <- carry
  if --r5 > 0 goto loop

Then, all that's left to do is to add the final carry bit, and
"fold" r0 down to 16 bits.

The major problem with my pseudo-implementation is the looping part,
because conditional branching is based on the carry bit, but I need
the carry bit preserved from one iteration to the next :-(

Typical looping code would be
  dt r5           ; decrement r5 and set t if r5 == 0
  bf loop         ; branch to loop if t == 0

I don't see a way out of this jam, do you?

I'm stuck with something along these lines

  clrt            ;  t <- 0
  mov #0, r0      ; r0 <- 0
loop:
  mov.l @r4+, r1  ; r1 <- [r4] and post-increment r4
  addc  r1, r0    ; r0 <- r0 + r1 + 0 and t <- carry
  addc  #0, r0    ; r0 <- r0 + t and t <- 0
  dt r5           ; decrement r5 and set t if r5 == 0
  bf loop         ; branch to loop if t == 0

I could unroll the loop a few times (most of the packets are
1480-bytes long) to save as many "premature" carry adds, but
I don't really like this solution.

Am I missing something obvious, or something not so obvious,
that would get me a nice, compact, and fast solution? :-)

Regards.