On 10/25/2011 3:06 AM, Noob wrote:

> I'm working with a 450-MHz SH-4 core. The architecture reminds me
> of the original Pentium in several ways, in that it is dual-pipeline
> with some restrictions on which types of instructions can be paired.
>
> When the system is receiving a 100-Mbps TCP stream (thus, at the same
> time also sending roughly 2 Mbps of ACKs) approximately 13-15% of the
> CPU is busy computing the internet checksum (C code compiled with
> gcc-4.3.4 -Os)
>
> This algorithm has several nice properties, e.g. it can be computed
> ignoring endian-ness, and if one is careful with the carry, it can be
> computed 32-bits (or 64-bits on a 64-bitter) at a time.
>
> So, I was trying to write some Q&D assembly to see if I could do better
> than gcc, using addc (add with carry).
>
> The gist of my pseudo-code was
>
> r4 = address of the data to checksum
> r5 = count of words in the data (assume>  0)
> t is the carry bit (don't know why it's called "t")
>
>    clrt            ;  t<- 0
>    mov #0, r0      ; r0<- 0
> loop:
>    mov.l @r4+, r1  ; r1<- [r4] and post-increment r4
>    addc  r1, r0    ; r0<- r0 + r1 + t and t<- carry
>    if --r5>  0 goto loop
>
> Then, all that's left to do is to add the final carry bit, and
> "fold" r0 down to 16 bits.

... assuming, of course, that you are operating over
a buffer having a size that is an *exact* multiple
of 4.

> The major problem with my pseudo-implementation is the looping part,
> because conditional branching is based on the carry bit, but I need
> the carry bit preserved from one iteration to the next :-(

Are you sure the decrement operation touches the carry?
Often, CY is unaffected but Z/NZ is signaled, instead.
(by contrast, SUBTRACTING 1 when you *really* want to
affect CY)

> Typical looping code would be
>    dt r5           ; decrement r5 and set t if r5 == 0
>    bf loop         ; branch to loop if t == 0
>
> I don't see a way out of this jam, do you?
>
> I'm stuck with something along these lines
>
>    clrt            ;  t<- 0
>    mov #0, r0      ; r0<- 0
> loop:
>    mov.l @r4+, r1  ; r1<- [r4] and post-increment r4
>    addc  r1, r0    ; r0<- r0 + r1 + 0 and t<- carry
>    addc  #0, r0    ; r0<- r0 + t and t<- 0
>    dt r5           ; decrement r5 and set t if r5 == 0
>    bf loop         ; branch to loop if t == 0
>
> I could unroll the loop a few times (most of the packets are
> 1480-bytes long) to save as many "premature" carry adds, but
> I don't really like this solution.

Why? Are you that pressed for space?  Checksums are
a *huge* bit of overhead *if* (!) you need to support
them!  Think hard on whether you want to skimp on space
here instead of someplace less critical (i.e., that
isn't being executed as frequently)

Depending on the cache available in the processor,
I typically unroll the loop into different sized
pieces and process N, 2N, 4N, ... "units" at a time
(this also takes care of the *actual* size of
the dataset instead of requiring it to be an
exact multiple of 32 bits)

E.g., (incredibly bogus pseudo-code):
    while (remaining >= 4N) {
      ADD *ptr++       // 1
      ADC *ptr++       // 2
      ...
      ADC *ptr++       // 4N
      ADC 0            // fold in final CY
      remaining -= 4N
    }
    while (remaining >= 2N) {
      ADD *ptr++       // 1
      ADC *ptr++       // 2
      ...
      ADC *ptr++       // 2N
      ADC 0            // fold in final CY
      remaining -= 2N
    }
    while (remaining > N) {
      ADD *ptr++       // 1
      ADC *ptr++       // 2
      ...
      ADC *ptr++       // N
      ADC 0            // fold in final CY
      remaining -= N
    }

If the CPU has alignment restrictions, you need to
handle those up front to avoid faults, etc.  If
you exercise some care in how you fabricate your
mbufs, you can sometimes skip this.

If you really want to cut down on TEXT, you can
represent "N" as "chunksize" and halve it each
time a "while" fails (then use a switch to
skip over a certain number of ADC's based on
the value of chunksize).  Something like:

    chunksize = 4N
    while (chunksize > 1) {
      ADD *ptr
      while (remaining > chunksize) {
         switch (chunksize) {
         case 4N:
            ADC *ptr++   // 4N
            ...
            ADC *ptr++   // 2N+1
            // fall thru
         case 2N:
            ADC *ptr++   // 2N
            ...
            ADC *ptr++   // N+1
            // fall thru
         case N:
            ADC *ptr++   // N
            ...
            ADC *ptr++   // 1
         }
         ADC  0            // fold in final CY
       }
       remaining -= chunksize
       chunksize /= 2
     }

[ignore the discrepancies in my counting, I'm just
trying to illustrate a general approach  :> ]

> Am I missing something obvious, or something not so obvious,
> that would get me a nice, compact, and fast solution? :-)

Make sure you consider when and why you are computing
the checksums!  Depending on your application and the
facilities that your stack provides, you might
consider a more "undisciplined" API for all of this.
I.e., if you can *omit* the computation, you save even
bigger!

[Layering can be A Bad Thing]

E.g., you might want to restrict the portion of the
packet/buffer on which you perform this computation.
I.e., you may opt NOT to check some of the payload
(depends on your application) -- this is becoming common
in UPnP type applications, for example.

You *also* might want to make some of this data *visible*
to your application!  E.g., if our application is going
to checksum some larger "transmission unit", then it
could possibly benefit from the computations that you are
performing, here.  This is the flipside of letting each
encapsulation layer IN THE STACK benefit from computations
made in other layers (i.e., incrementally updating the
sums)

Alternatively, if your application is going to make some
particular change to the packet and then return/forward
it, you maybe able to benefit from these previous
computations.

Reconsider whether a strict TCP implementation is
necessary or if you can get what you want from other
protocols -- even if you have to add some bit of
"acknowledgement" in the application layer.

Lastly, look at what you are going to *do* in the event
of a failed checksum -- if *all* you are EFFECTIVELY doing
is updating some counters/statistics, then REconsider whether
you actually need to invest this effort.

On Oct 25, 3:26=A0pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> One alternative approach is to use shift & mask to split each 16-byte
> block into two halves and aggregate them separately, this will work for
> up to 128K packet sizes, which is well above what TCP will allow:
>
> next16:
> =A0 =A0 movdqa xmm7,[esi]
> =A0 =A0 lea esi,[esi+16]
> =A0 =A0 movdqa xmm6,xmm2 =A0 =A0; 4 x 0x0000ffff masks
> =A0 =A0 pand xmm6,xmm7
> =A0 =A0 psrld xmm7,16
> =A0 =A0 paddd xmm0,xmm6
> =A0 =A0 paddd xmm0,xmm7
> =A0 =A0 sub ecx,1
> =A0 =A0 =A0jnz next16
>
> Notice that this approach needs one additional SIMD instruction per
> 16-byte block.

Another way is to accumulate both the data, and separately the data
shifted right by some bits (say 16 in your case).  Since you add at
most
65536 items, that second accumulator differs from the top 32 bits of a
true 48-bit accumulator by less than 65536, so you can reconstruct it.
In pseudo-C:

u32 acc_hi =3D 0;
u32 acc_lo =3D 0;
for (j =3D 0; j < N; j++) {
  acc_lo +=3D data[j];
  acc_hi +=3D data[j] >> 16;
}
acc_hi +=3D ((acc_lo >> 16) - acc_hi) & 0xffff;

// now acc_hi is the top 32 bits of the 48-bit accumulator, and acc_lo
// is the low 32 bits


This saves two instructions from your SSE loop, but it's much better
on most RISC.

On Oct 25, 4:26 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> ...
> On any architecture without this particular bug/feature/glitch the best
> you can do is almost certainly to unroll by a small amount and then keep
> a separate carry accumulator like your r0 above for use at the end.

Here is how I did it on power 6-7 years ago: http://tgi-sci.com/vpaex/cksm3456.sa
(or cksm3456.txt for the native code list).

Here is the critical part:
* deal with the bulk of the data using .l accesses
         lsr.l+    #2,d2,d3            number of .l words
         move.l-   d3,ctr              setup counter
         rlwinm-   d2,#0,#32-2,#31,d2  number of bytes after last .l
         neg.l-    d2,d1               residue in next segment
         rlwinm-   d1,#0,#32-2,#31,d1  must be 0-3
         beq       nulllngc            if none .l to process
         lea       (-4,a0),a0          to use preincrement
         set       loop,*
         adde.l-   =(4,a0),d0          collect checksum like a
champion
         brc       ?unc.dbnz,loop      all of the .l
         lea       (4,a0),a0           will use postincrement now
nulllngc

Never thought of this as of something consuming much resources,
I don't think it does really.
I have never measured it (not seeing an issue there), in a loop that
tight the only stall can come from data dependencies. And they
probably do kick in but wrestling them would be too major and
with doubtful results for 1500 bytes (setup/registers save/restore
etc).

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

Arlet Ottens wrote:
> On 10/25/2011 12:06 PM, Noob wrote:
>
>> I'm stuck with something along these lines
>>
>> clrt ; t<- 0
>> mov #0, r0 ; r0<- 0
>> loop:
>> mov.l @r4+, r1 ; r1<- [r4] and post-increment r4
>> addc r1, r0 ; r0<- r0 + r1 + 0 and t<- carry
>> addc #0, r0 ; r0<- r0 + t and t<- 0
>> dt r5 ; decrement r5 and set t if r5 == 0
>> bf loop ; branch to loop if t == 0
>>
>> I could unroll the loop a few times (most of the packets are
>> 1480-bytes long) to save as many "premature" carry adds, but
>> I don't really like this solution.
>
> Why not ? Unrolling this loop will give the best results. Even unrolling
> by a factor of 8 should make the carry overhead insignificant, and only
> increase code size by a small amount.

Arlet, you're absolutely correct here.

This particular algorithm matches up perfectly with the original x86 
instruction set, where INC and DEC avoids updating the carry flag, so 
that you can use DEC / JNZ or even JCXNZ to do loop control while the 
carry flag survives from round to round.

On any architecture without this particular bug/feature/glitch the best 
you can do is almost certainly to unroll by a small amount and then keep 
a separate carry accumulator like your r0 above for use at the end.

On machines without carry flags, or when working with SIMD registers, 
you can do the same using explicit parallel compare instructions, then 
aggregating the flag values resulting from those compares.

I.e. the following loop should handle 32 bytes per iteration, using 10 
SSE instructions, so 3+ bytes/cycle without having to dual-issue the 
SIMD opcodes:

(The main tweak is the need to bias all input values, from [0..64K> to 
[-32K..32K> so that signed compares will work.)

next32:
     pxor xmm7,[esi]	; Grab next 8 words, invert sign bits
     pxor xmm6,[esi+16]  ; and another 16
     lea esi,[esi+32]
     paddw xmm0,xmm7
     pcmpgtw xmm7,xmm0	; Set flag if src > sum
     paddw xmm0,xmm6
     pcmpgtw xmm6,xmm0	; ditto
     psubw xmm1,xmm7	; Subtract -1 flag -> increment
     movdqa xmm7,xmm2	; Load sign bits
     psubw xmm1,xmm6
     movdqa xmm6,xmm2
     sub ecx,1
      jnz next32

One alternative approach is to use shift & mask to split each 16-byte 
block into two halves and aggregate them separately, this will work for 
up to 128K packet sizes, which is well above what TCP will allow:

next16:
    movdqa xmm7,[esi]
    lea esi,[esi+16]
    movdqa xmm6,xmm2	; 4 x 0x0000ffff masks
    pand xmm6,xmm7
    psrld xmm7,16
    paddd xmm0,xmm6
    paddd xmm0,xmm7
    sub ecx,1
     jnz next16

Notice that this approach needs one additional SIMD instruction per 
16-byte block.

Terje
-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

On 10/25/2011 12:06 PM, Noob wrote:

> I'm stuck with something along these lines
>
>    clrt            ;  t<- 0
>    mov #0, r0      ; r0<- 0
> loop:
>    mov.l @r4+, r1  ; r1<- [r4] and post-increment r4
>    addc  r1, r0    ; r0<- r0 + r1 + 0 and t<- carry
>    addc  #0, r0    ; r0<- r0 + t and t<- 0
>    dt r5           ; decrement r5 and set t if r5 == 0
>    bf loop         ; branch to loop if t == 0
>
> I could unroll the loop a few times (most of the packets are
> 1480-bytes long) to save as many "premature" carry adds, but
> I don't really like this solution.

Why not ? Unrolling this loop will give the best results. Even unrolling 
by a factor of 8 should make the carry overhead insignificant, and only 
increase code size by a small amount.

Hello,

I'm working with a 450-MHz SH-4 core. The architecture reminds me
of the original Pentium in several ways, in that it is dual-pipeline
with some restrictions on which types of instructions can be paired.

When the system is receiving a 100-Mbps TCP stream (thus, at the same
time also sending roughly 2 Mbps of ACKs) approximately 13-15% of the
CPU is busy computing the internet checksum (C code compiled with
gcc-4.3.4 -Os)

This algorithm has several nice properties, e.g. it can be computed
ignoring endian-ness, and if one is careful with the carry, it can be
computed 32-bits (or 64-bits on a 64-bitter) at a time.

So, I was trying to write some Q&D assembly to see if I could do better
than gcc, using addc (add with carry).

The gist of my pseudo-code was

r4 = address of the data to checksum
r5 = count of words in the data (assume > 0)
t is the carry bit (don't know why it's called "t")

  clrt            ;  t <- 0
  mov #0, r0      ; r0 <- 0
loop:
  mov.l @r4+, r1  ; r1 <- [r4] and post-increment r4
  addc  r1, r0    ; r0 <- r0 + r1 + t and t <- carry
  if --r5 > 0 goto loop

Then, all that's left to do is to add the final carry bit, and
"fold" r0 down to 16 bits.

The major problem with my pseudo-implementation is the looping part,
because conditional branching is based on the carry bit, but I need
the carry bit preserved from one iteration to the next :-(

Typical looping code would be
  dt r5           ; decrement r5 and set t if r5 == 0
  bf loop         ; branch to loop if t == 0

I don't see a way out of this jam, do you?

I'm stuck with something along these lines

  clrt            ;  t <- 0
  mov #0, r0      ; r0 <- 0
loop:
  mov.l @r4+, r1  ; r1 <- [r4] and post-increment r4
  addc  r1, r0    ; r0 <- r0 + r1 + 0 and t <- carry
  addc  #0, r0    ; r0 <- r0 + t and t <- 0
  dt r5           ; decrement r5 and set t if r5 == 0
  bf loop         ; branch to loop if t == 0

I could unroll the loop a few times (most of the packets are
1480-bytes long) to save as many "premature" carry adds, but
I don't really like this solution.

Am I missing something obvious, or something not so obvious,
that would get me a nice, compact, and fast solution? :-)

Regards.