RSA on 8bit micro| page 3

Reply by Le Chaud Lapin ●January 30, 20072007-01-30

On Jan 30, 10:51 am, "tum_" <atoumantsev_s...@mail.ru> wrote:
> There's only partial code in the article, not the full implementation
> and I'm a bit puzzled with the sentence "Division takes only a little
> longer than multiplication..." - I'm a bit in doubt about that but
> maybe I'm wrong. IIRC, in my experience long division took at least
> twice as long as multiplication.

You're right.  Division takes at least 2-3 times as long as 
multiplication using the fastest general-purpose long-division 
algorithm, described in Section 4.3.1 of one of Knuth's books.  Many 
programmers are surprised to discover how complicated division is 
relative to multiplication when they implement a big integer library.

-Le Chaud Lapin-

Reply by tum_ ●January 31, 20072007-01-31

On Jan 30, 7:27 pm, "Le Chaud Lapin" <jaibudu...@gmail.com> wrote:
>
> You're right.  Division takes at least 2-3 times as long as
> multiplication using the fastest general-purpose long-division
> algorithm, described in Section 4.3.1 of one of Knuth's books.  Many
> programmers are surprised to discover how complicated division is
> relative to multiplication when they implement a big integer library.
>
> -Le Chaud Lapin-

I've posted this msg through google groups about 12 hours ago and it
still didn't appear. Luckily, I always get the text into clipboard
before hitting Send and luckily it remained in the clipboard.
2nd attempt:
-------------
That's my experience too, although I have to admit I've never
implemented long division myself, I used an existing implementation
for my RSA (and I remember it was based on Knuth).
I wouldn't expect a person like Burt Kaliski to publish the statement
that is that wrong, though.
I'd be interested in Francois's opinion on this matter as I remember
one of his posts claiming that classical ModExp algorithm (the one
using long division) can be as good as Montgomery approach (where no
classical division is used).
(Francois, I'm 90% sure it was your post but if I mistook you for
someone else - sorry).

I also wonder if the hardware DIV instruction affects the long
division performance - has anyone had any experience with such
platforms?

Reply by Francois Grieu ●January 31, 20072007-01-31

In article <1170230024.658846.177070@v45g2000cwv.googlegroups.com>,
 "tum_" <atoumantsev_spam@mail.ru> wrote:

> On Jan 30, 7:27 pm, "Le Chaud Lapin" <jaibudu...@gmail.com> wrote:
>> Division takes at least 2-3 times as long as multiplication using
>> the fastest general-purpose long-division algorithm, described
>> in Section 4.3.1 of one of Knuth's books.  Many programmers are
>> surprised to discover how complicated division is relative to
>> multiplication when they implement a big integer library.
>
> That's my experience too, although I have to admit I've never
> implemented long division myself, I used an existing implementation
> for my RSA (and I remember it was based on Knuth).
> I wouldn't expect a person like Burt Kaliski to publish the statement
> that is that wrong, though.
> I'd be interested in Francois's opinion on this matter as I remember
> one of his posts claiming that classical ModExp algorithm (the one
> using long division) can be as good as Montgomery approach (where no
> classical division is used).

My two cents worth:

In the context of RSA implementation in software, with
k-bit modulus n, and a w*w->2w-bit hardware multiplication,
use of asymptotically optimum algorithms (FFT, even Toom-Cook)
is never advocated for realistic parameters (w>=8, k<=2048).

Some advocate the Karatsuba method. Indeed it does pay, slightly,
for multiplication, above some threshold (k/w > ~50), but:
- efficient implementations of Karatsuba multiplication
for variable size arguments are complex and hard to check;
simple ones are less efficient.
- use of Karatsuba multiplication largely increase both RAM
and code footprint.
- use of Karatsuba multiplication in modular reduction
does NOT pay at all for practical k/w ratios, thus modular
reduction (which cost, in any case, exceeds multiplication)
limits any overall gain in using Karatsuba.
- Karatsuba breaks simultaneous multiplication and modular
reduction, a technique with many advantages, including
reduced RAM footprint and simpler arguments indexing.
[the sequential approach to computing x*y mod n is
to compute the 2k-bit x*y, then reduce it modulo n to
k-bit; the simultaneous approach performs modular reduction
of partial results, which remain about k-bit at all time]

Thus practical implementations (that I know) all use methods
which asymptotic cost is O(k^3) for the private-key RSA modexp,
and O(k^2) for public-key with fixed exponent (e=3, 65537..).
They can be classified according to at least two criteria:
- Montgomery (vs natural) arithmetic
- simultaneous (vs sequential) modular reduction.

Montgomery arithmetic is popular for high-performance
private-key modexp: the difficult quotient estimation in
natural arithmetic modular reduction becomes simple;
off-by-one (or two) quotient estimation (rare but inevitable
in natural arithmetic) is plain gone; simultaneous modular
reduction is quite easy. The main drawback is extra work
at the beginning and end of computation, but the cost is
marginal for private-key operation. Thus many high-performance
crypto libraries use Montgomery with simultaneous
multiplication and modular reduction (most hardware
cryptoengines do that, too).

I often recommend natural arithmetic with simultaneous
multiplication and modular reduction. Yes it is possible, and
when done properly, the constant  c  in the asymptotic cost
o(c.k^2) of modular multiplication and reduction ends up
the same as in Montgomery with simultaneous multiplication
and modular reduction; there are only linearly more
multiplication/adds. It does beat Montgomery (by saving on
pre/post computations) in at least two cases
- low fixed exponent (public-key only, e.g. as in checking
and RSA signature)
- tight memory


> (Francois, I'm 90% sure it was your post but if I mistook you
> for someone else - sorry).

That most likely was a post of mine.


  Francois Grieu

Reply by tum_ ●January 31, 20072007-01-31

On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1170230024.658846.177...@v45g2000cwv.googlegroups.com>,
>
>
>
>  "tum_" <atoumantsev_s...@mail.ru> wrote:
> > On Jan 30, 7:27 pm, "Le Chaud Lapin" <jaibudu...@gmail.com> wrote:
> >> Division takes at least 2-3 times as long as multiplication using
> >> the fastest general-purpose long-division algorithm, described
> >> in Section 4.3.1 of one of Knuth's books.  Many programmers are
> >> surprised to discover how complicated division is relative to
> >> multiplication when they implement a big integer library.
>
> > That's my experience too, although I have to admit I've never
> > implemented long division myself, I used an existing implementation
> > for my RSA (and I remember it was based on Knuth).
> > I wouldn't expect a person like Burt Kaliski to publish the statement
> > that is that wrong, though.
> > I'd be interested in Francois's opinion on this matter as I remember
> > one of his posts claiming that classical ModExp algorithm (the one
> > using long division) can be as good as Montgomery approach (where no
> > classical division is used).
>
> My two cents worth:
>
> In the context of RSA implementation in software, with
> k-bit modulus n, and a w*w->2w-bit hardware multiplication,
> use of asymptotically optimum algorithms (FFT, even Toom-Cook)
> is never advocated for realistic parameters (w>=8, k<=2048).
>
> Some advocate the Karatsuba method. Indeed it does pay, slightly,
> for multiplication, above some threshold (k/w > ~50), but:

I'd say that the threshold is very platform dependent. By using
Karatsuba we reduce the number of multiplications but increase the
number of additions/subtractions (and probably read/writes). So the
threshold depends on the MUL vs ADD performance ratio for a given
system. When I estimated this threshold for ARM7TDMI, where MUL is
pretty fast, I came up to the conclusion that Karatsuba will start to
pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32).
While when a friend of mine worked at the same task for Z80 (no
hardware mul at all) he reported very noticeable speed up on 256 bit
numbers if not shorter.

> - efficient implementations of Karatsuba multiplication
> for variable size arguments are complex and hard to check;
> simple ones are less efficient.
> - use of Karatsuba multiplication largely increase both RAM
> and code footprint.
> - use of Karatsuba multiplication in modular reduction
> does NOT pay at all for practical k/w ratios, thus modular
> reduction (which cost, in any case, exceeds multiplication)
> limits any overall gain in using Karatsuba.
> - Karatsuba breaks simultaneous multiplication and modular
> reduction, a technique with many advantages, including
> reduced RAM footprint and simpler arguments indexing.
> [the sequential approach to computing x*y mod n is
> to compute the 2k-bit x*y, then reduce it modulo n to
> k-bit; the simultaneous approach performs modular reduction
> of partial results, which remain about k-bit at all time]
>
> Thus practical implementations (that I know) all use methods
> which asymptotic cost is O(k^3) for the private-key RSA modexp,
> and O(k^2) for public-key with fixed exponent (e=3, 65537..).
> They can be classified according to at least two criteria:
> - Montgomery (vs natural) arithmetic
> - simultaneous (vs sequential) modular reduction.
>
> Montgomery arithmetic is popular for high-performance
> private-key modexp: the difficult quotient estimation in
> natural arithmetic modular reduction becomes simple;
> off-by-one (or two) quotient estimation (rare but inevitable
> in natural arithmetic) is plain gone; simultaneous modular
> reduction is quite easy. The main drawback is extra work
> at the beginning and end of computation, but the cost is
> marginal for private-key operation. Thus many high-performance
> crypto libraries use Montgomery with simultaneous
> multiplication and modular reduction (most hardware
> cryptoengines do that, too).
>
> I often recommend natural arithmetic with simultaneous
> multiplication and modular reduction. Yes it is possible, and
> when done properly, the constant  c  in the asymptotic cost
> o(c.k^2) of modular multiplication and reduction ends up
> the same as in Montgomery with simultaneous multiplication
> and modular reduction; there are only linearly more
> multiplication/adds. It does beat Montgomery (by saving on
> pre/post computations) in at least two cases
> - low fixed exponent (public-key only, e.g. as in checking
> and RSA signature)
> - tight memory
>
>   Francois Grieu

Thanks for keeping up the (interesting) thread.

You reminded me of a problem that puzzled me a while ago and which I
failed to solve for myself. Here's my post in sci.crypt from July,
2005 that remained unanswered:

http://groups.google.com/group/sci.crypt/browse_thread/thread/2d8157967b3806a/e72cc68404f51033?lnk=gst&q=tum_&rnum=6#e72cc68404f51033

Basically, for the methods that you call 'simultaneous/sequential' Koc
in his paper uses terms 'Integrated/Separated' correspondingly. And
the time I was studying the paper I came to conclusion that it is not
possible to combine any 'integrated' (simultaneous) method with the
optimised squaring algorithm. And if this is true then we are really
losing a lot of performance on exponents such as 65537, where squaring
is the dominant operation. On exponent 3 this of course is not an
issue.
Hmm, it's just occurred to me that I'm talking about Mongomery
arithmetics, not the natural one that you "advocate" ;)
Anyway, do you have the experience of implementing the simultaneous
approach for optimised squaring?

Alexei.

Reply by Colin Paul Gloster ●January 31, 20072007-01-31

"tum_" <atoumantsev_spam@mail.ru> wrote:

"[..]

[..] I remember
searching for it about 3 years ago and, iirc, there was no _free_
access to this particular article on the Dobbs site and I didn't
bother to subscribe and pay at that time."

I agree that a few years ago WWW.DDJ.com did not contain the CDs'
copies of the articles for gratis.

P.S. I am sorry for screwing the References field in news:epn9c7$88f$1@newsserver.cilea.it

Reply by Francois Grieu ●January 31, 20072007-01-31

In article <1170246700.858450.170720@h3g2000cwc.googlegroups.com>,
 "tum_" <atoumantsev_spam@mail.ru> wrote:

> On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote:
> >
> > In the context of RSA implementation in software, with
> > k-bit modulus n, and a w*w->2w-bit hardware multiplication,(..)
> > some advocate the Karatsuba method. Indeed it does pay, slightly,
> > for multiplication, above some threshold (k/w > ~50) (..)
> 
> I'd say that the threshold is very platform dependent. By using
> Karatsuba we reduce the number of multiplications but increase the
> number of additions/subtractions (and probably read/writes). So the
> threshold depends on the MUL vs ADD performance ratio for a given
> system. When I estimated this threshold for ARM7TDMI, where MUL is
> pretty fast, I came up to the conclusion that Karatsuba will start to
> pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32).
> While when a friend of mine worked at the same task for Z80 (no
> hardware mul at all) he reported very noticeable speed up on 256 bit
> numbers if not shorter.

Indeed, the thresold is very platform dependent, for the most part
because w (word size at input of hardware multilier) is platform
independent. And on the Z80, there is not hardware multiplier, so w
maybe should be considered about 3 (a compromize between 1, for a
1-bit processor, and 8, for an 8-bit processor with MUL, e.g. 8051).
This brings above numbers more in balance.

> You reminded me of a problem that puzzled me a while ago and which I
> failed to solve for myself. (..) I came to conclusion that it is not
> possible to combine any 'integrated' (simultaneous) method with the
> optimised squaring algorithm.

My experience and your conclusion are in full agreement.

> And if this is true then we are really losing a lot of performance
> on exponents such as 65537, where squaring is the dominant operation.

The squaring trick saves 1/4 of elementary multiplications
in a mulmod that is actually is sqrmod; about 2/3 to say 7/8 of the mulmod
are sqrmod (depending on memory constraints / outer exponentiation
techniques), lowering that potential gain to say 20%.
**But** the squaring trick does not save on additions, and much to the
contrary makes half of them more complex to implement. This considerably
lower the gain (on machines with hardware multiplication).
The worse IMHO is that the squaring trick at least collides with
simultaneous/integrated modular reduction, which does save considerably
on memory acesses and index computations.

> Hmm, it's just occurred to me that I'm talking about Mongomery
> arithmetics, not the natural one that you "advocate" ;)

My discussion applies to both.

> Anyway, do you have the experience of implementing the simultaneous
> approach for optimised squaring?

No. I won't bet that it can't be done, but fail to see how.

  Francois Grieu

Reply by Francois Grieu ●January 31, 20072007-01-31

In article <fgrieu-46BDB7.14353731012007@news-4.proxad.net>,
I mistakenly wrote:

> w (word size at input of hardware multilier) is platform
> independent.

That should be platform-dependent, of course.

  Francois Grieu

Reply by tum_ ●January 31, 20072007-01-31

On Jan 31, 1:52 pm, Francois Grieu <fgr...@gmail.com> wrote:
> In article <fgrieu-46BDB7.14353731012...@news-4.proxad.net>,
>
> I mistakenly wrote:
> > w (word size at input of hardware multilier) is platform
> > independent.
>
> That should be platform-dependent, of course.
>
>   Francois Grieu

I haven't notice the typo since it was obvious what you meant.
I'll probably add some comments a bit later, I need to read your post
again more carefully (no time at the moment).

Reply by tum_ ●January 31, 20072007-01-31

On Jan 31, 1:35 pm, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1170246700.858450.170...@h3g2000cwc.googlegroups.com>,
>
>
>
>  "tum_" <atoumantsev_s...@mail.ru> wrote:
> > On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote:
>
> > > In the context of RSA implementation in software, with
> > > k-bit modulus n, and a w*w->2w-bit hardware multiplication,(..)
> > > some advocate the Karatsuba method. Indeed it does pay, slightly,
> > > for multiplication, above some threshold (k/w > ~50) (..)
>
> > I'd say that the threshold is very platform dependent. By using
> > Karatsuba we reduce the number of multiplications but increase the
> > number of additions/subtractions (and probably read/writes). So the
> > threshold depends on the MUL vs ADD performance ratio for a given
> > system. When I estimated this threshold for ARM7TDMI, where MUL is
> > pretty fast, I came up to the conclusion that Karatsuba will start to
> > pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32).
> > While when a friend of mine worked at the same task for Z80 (no
> > hardware mul at all) he reported very noticeable speed up on 256 bit
> > numbers if not shorter.
>
> Indeed, the thresold is very platform dependent, for the most part
> because w (word size at input of hardware multilier) is platform
> independent. And on the Z80, there is not hardware multiplier, so w
> maybe should be considered about 3 (a compromize between 1, for a
> 1-bit processor, and 8, for an 8-bit processor with MUL, e.g. 8051).
> This brings above numbers more in balance.

I'm ready to agree that your 'k/w = ~50' can be used as a rough
estimate and is probably applicable to many (or most) real life
platforms. But I still see it as a correllation of MUL vs ADD
instruction performance ratio.
Consider, for example, two absolutely identical platforms where the
only difference is that on platform 1 the MUL instr takes 3 cycles,
while on platform 2 it takes 20 cycles. You formula won't work :-)
(In general, it's hard to produce any simple & precise formula as
there's quite a lot of other factors involved - like memory speed, for
instance. Hence, if you want to be precise, the only way to go is
implement and measure.)


> My experience and your conclusion are in full agreement.
>
> > And if this is true then we are really losing a lot of performance
> > on exponents such as 65537, where squaring is the dominant operation.
>
> The squaring trick saves 1/4 of elementary multiplications
> in a mulmod that is actually is sqrmod; about 2/3 to say 7/8 of the mulmod
> are sqrmod (depending on memory constraints / outer exponentiation
> techniques), lowering that potential gain to say 20%.
> **But** the squaring trick does not save on additions, and much to the
> contrary makes half of them more complex to implement. This considerably
> lower the gain (on machines with hardware multiplication).
> The worse IMHO is that the squaring trick at least collides with
> simultaneous/integrated modular reduction, which does save considerably
> on memory acesses and index computations.

Still need to think this over a bit more... But the reasoning sounds
quite correct.

> > Hmm, it's just occurred to me that I'm talking about Mongomery
> > arithmetics, not the natural one that you "advocate" ;)
>
> My discussion applies to both.
>
> > Anyway, do you have the experience of implementing the simultaneous
> > approach for optimised squaring?
>
> No. I won't bet that it can't be done, but fail to see how.
>
>   Francois Grieu

PS. I've only recently realised that this discussion is not in the
sci.crypt, which probably explains the absence of several well known
people in this thread I would expect to give their comments on the
topic...

Reply by tum_ ●January 31, 20072007-01-31

same story again, the post went to a black hole. Re-posting:

On Jan 31, 1:35 pm, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1170246700.858450.170...@h3g2000cwc.googlegroups.com>,
>
>
>
>  "tum_" <atoumantsev_s...@mail.ru> wrote:
> > On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote:
>
> > > In the context of RSA implementation in software, with
> > > k-bit modulus n, and a w*w->2w-bit hardware multiplication,(..)
> > > some advocate the Karatsuba method. Indeed it does pay, slightly,
> > > for multiplication, above some threshold (k/w > ~50) (..)
>
> > I'd say that the threshold is very platform dependent. By using
> > Karatsuba we reduce the number of multiplications but increase the
> > number of additions/subtractions (and probably read/writes). So the
> > threshold depends on the MUL vs ADD performance ratio for a given
> > system. When I estimated this threshold for ARM7TDMI, where MUL is
> > pretty fast, I came up to the conclusion that Karatsuba will start to
> > pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32).
> > While when a friend of mine worked at the same task for Z80 (no
> > hardware mul at all) he reported very noticeable speed up on 256 bit
> > numbers if not shorter.
>
> Indeed, the thresold is very platform dependent, for the most part
> because w (word size at input of hardware multilier) is platform
> independent. And on the Z80, there is not hardware multiplier, so w
> maybe should be considered about 3 (a compromize between 1, for a
> 1-bit processor, and 8, for an 8-bit processor with MUL, e.g. 8051).
> This brings above numbers more in balance.

I'm ready to agree that your 'k/w = ~50' can be used as a rough
estimate and is probably applicable to many (or most) real life
platforms. But I still see it as a correllation of MUL vs ADD
instruction performance ratio.
Consider, for example, two absolutely identical platforms where the
only difference is that on platform 1 the MUL instr takes 3 cycles,
while on platform 2 it takes 20 cycles. You formula won't work :-)
(In general, it's hard to produce any simple & precise formula as
there's quite a lot of other factors involved - like memory speed, for
instance. Hence, if you want to be precise, the only way to go is
implement and measure.)


> My experience and your conclusion are in full agreement.
>
> > And if this is true then we are really losing a lot of performance
> > on exponents such as 65537, where squaring is the dominant operation.
>
> The squaring trick saves 1/4 of elementary multiplications
> in a mulmod that is actually is sqrmod; about 2/3 to say 7/8 of the mulmod
> are sqrmod (depending on memory constraints / outer exponentiation
> techniques), lowering that potential gain to say 20%.
> **But** the squaring trick does not save on additions, and much to the
> contrary makes half of them more complex to implement. This considerably
> lower the gain (on machines with hardware multiplication).
> The worse IMHO is that the squaring trick at least collides with
> simultaneous/integrated modular reduction, which does save considerably
> on memory acesses and index computations.

Still need to think this over a bit more... But the reasoning sounds
quite correct.

> > Hmm, it's just occurred to me that I'm talking about Mongomery
> > arithmetics, not the natural one that you "advocate" ;)
>
> My discussion applies to both.
>
> > Anyway, do you have the experience of implementing the simultaneous
> > approach for optimised squaring?
>
> No. I won't bet that it can't be done, but fail to see how.
>
>   Francois Grieu

PS. I've only recently realised that this discussion is not in the
sci.crypt, which probably explains the absence of several well known
people in this thread I would expect to give their comments on the
topic...

Previous 1 234 Next

RSA on 8bit micro

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group