On 10/25/2011 3:06 AM, Noob wrote:
> I'm working with a 450-MHz SH-4 core. The architecture reminds me
> of the original Pentium in several ways, in that it is dual-pipeline
> with some restrictions on which types of instructions can be paired.
>
> When the system is receiving a 100-Mbps TCP stream (thus, at the same
> time also sending roughly 2 Mbps of ACKs) approximately 13-15% of the
> CPU is busy computing the internet checksum (C code compiled with
> gcc-4.3.4 -Os)
>
> This algorithm has several nice properties, e.g. it can be computed
> ignoring endian-ness, and if one is careful with the carry, it can be
> computed 32-bits (or 64-bits on a 64-bitter) at a time.
>
> So, I was trying to write some Q&D assembly to see if I could do better
> than gcc, using addc (add with carry).
>
> The gist of my pseudo-code was
>
> r4 = address of the data to checksum
> r5 = count of words in the data (assume> 0)
> t is the carry bit (don't know why it's called "t")
>
> clrt ; t<- 0
> mov #0, r0 ; r0<- 0
> loop:
> mov.l @r4+, r1 ; r1<- [r4] and post-increment r4
> addc r1, r0 ; r0<- r0 + r1 + t and t<- carry
> if --r5> 0 goto loop
>
> Then, all that's left to do is to add the final carry bit, and
> "fold" r0 down to 16 bits.
... assuming, of course, that you are operating over
a buffer having a size that is an *exact* multiple
of 4.
> The major problem with my pseudo-implementation is the looping part,
> because conditional branching is based on the carry bit, but I need
> the carry bit preserved from one iteration to the next :-(
Are you sure the decrement operation touches the carry?
Often, CY is unaffected but Z/NZ is signaled, instead.
(by contrast, SUBTRACTING 1 when you *really* want to
affect CY)
> Typical looping code would be
> dt r5 ; decrement r5 and set t if r5 == 0
> bf loop ; branch to loop if t == 0
>
> I don't see a way out of this jam, do you?
>
> I'm stuck with something along these lines
>
> clrt ; t<- 0
> mov #0, r0 ; r0<- 0
> loop:
> mov.l @r4+, r1 ; r1<- [r4] and post-increment r4
> addc r1, r0 ; r0<- r0 + r1 + 0 and t<- carry
> addc #0, r0 ; r0<- r0 + t and t<- 0
> dt r5 ; decrement r5 and set t if r5 == 0
> bf loop ; branch to loop if t == 0
>
> I could unroll the loop a few times (most of the packets are
> 1480-bytes long) to save as many "premature" carry adds, but
> I don't really like this solution.
Why? Are you that pressed for space? Checksums are
a *huge* bit of overhead *if* (!) you need to support
them! Think hard on whether you want to skimp on space
here instead of someplace less critical (i.e., that
isn't being executed as frequently)
Depending on the cache available in the processor,
I typically unroll the loop into different sized
pieces and process N, 2N, 4N, ... "units" at a time
(this also takes care of the *actual* size of
the dataset instead of requiring it to be an
exact multiple of 32 bits)
E.g., (incredibly bogus pseudo-code):
while (remaining >= 4N) {
ADD *ptr++ // 1
ADC *ptr++ // 2
...
ADC *ptr++ // 4N
ADC 0 // fold in final CY
remaining -= 4N
}
while (remaining >= 2N) {
ADD *ptr++ // 1
ADC *ptr++ // 2
...
ADC *ptr++ // 2N
ADC 0 // fold in final CY
remaining -= 2N
}
while (remaining > N) {
ADD *ptr++ // 1
ADC *ptr++ // 2
...
ADC *ptr++ // N
ADC 0 // fold in final CY
remaining -= N
}
If the CPU has alignment restrictions, you need to
handle those up front to avoid faults, etc. If
you exercise some care in how you fabricate your
mbufs, you can sometimes skip this.
If you really want to cut down on TEXT, you can
represent "N" as "chunksize" and halve it each
time a "while" fails (then use a switch to
skip over a certain number of ADC's based on
the value of chunksize). Something like:
chunksize = 4N
while (chunksize > 1) {
ADD *ptr
while (remaining > chunksize) {
switch (chunksize) {
case 4N:
ADC *ptr++ // 4N
...
ADC *ptr++ // 2N+1
// fall thru
case 2N:
ADC *ptr++ // 2N
...
ADC *ptr++ // N+1
// fall thru
case N:
ADC *ptr++ // N
...
ADC *ptr++ // 1
}
ADC 0 // fold in final CY
}
remaining -= chunksize
chunksize /= 2
}
[ignore the discrepancies in my counting, I'm just
trying to illustrate a general approach :> ]
> Am I missing something obvious, or something not so obvious,
> that would get me a nice, compact, and fast solution? :-)
Make sure you consider when and why you are computing
the checksums! Depending on your application and the
facilities that your stack provides, you might
consider a more "undisciplined" API for all of this.
I.e., if you can *omit* the computation, you save even
bigger!
[Layering can be A Bad Thing]
E.g., you might want to restrict the portion of the
packet/buffer on which you perform this computation.
I.e., you may opt NOT to check some of the payload
(depends on your application) -- this is becoming common
in UPnP type applications, for example.
You *also* might want to make some of this data *visible*
to your application! E.g., if our application is going
to checksum some larger "transmission unit", then it
could possibly benefit from the computations that you are
performing, here. This is the flipside of letting each
encapsulation layer IN THE STACK benefit from computations
made in other layers (i.e., incrementally updating the
sums)
Alternatively, if your application is going to make some
particular change to the packet and then return/forward
it, you maybe able to benefit from these previous
computations.
Reconsider whether a strict TCP implementation is
necessary or if you can get what you want from other
protocols -- even if you have to add some bit of
"acknowledgement" in the application layer.
Lastly, look at what you are going to *do* in the event
of a failed checksum -- if *all* you are EFFECTIVELY doing
is updating some counters/statistics, then REconsider whether
you actually need to invest this effort.