EmbeddedRelated.com
Forums

RSA on 8bit micro

Started by W Trembyle January 24, 2007
On Jan 30, 10:51 am, "tum_" <atoumantsev_s...@mail.ru> wrote:
> There's only partial code in the article, not the full implementation > and I'm a bit puzzled with the sentence "Division takes only a little > longer than multiplication..." - I'm a bit in doubt about that but > maybe I'm wrong. IIRC, in my experience long division took at least > twice as long as multiplication.
You're right. Division takes at least 2-3 times as long as multiplication using the fastest general-purpose long-division algorithm, described in Section 4.3.1 of one of Knuth's books. Many programmers are surprised to discover how complicated division is relative to multiplication when they implement a big integer library. -Le Chaud Lapin-
On Jan 30, 7:27 pm, "Le Chaud Lapin" <jaibudu...@gmail.com> wrote:
> > You're right. Division takes at least 2-3 times as long as > multiplication using the fastest general-purpose long-division > algorithm, described in Section 4.3.1 of one of Knuth's books. Many > programmers are surprised to discover how complicated division is > relative to multiplication when they implement a big integer library. > > -Le Chaud Lapin-
I've posted this msg through google groups about 12 hours ago and it still didn't appear. Luckily, I always get the text into clipboard before hitting Send and luckily it remained in the clipboard. 2nd attempt: ------------- That's my experience too, although I have to admit I've never implemented long division myself, I used an existing implementation for my RSA (and I remember it was based on Knuth). I wouldn't expect a person like Burt Kaliski to publish the statement that is that wrong, though. I'd be interested in Francois's opinion on this matter as I remember one of his posts claiming that classical ModExp algorithm (the one using long division) can be as good as Montgomery approach (where no classical division is used). (Francois, I'm 90% sure it was your post but if I mistook you for someone else - sorry). I also wonder if the hardware DIV instruction affects the long division performance - has anyone had any experience with such platforms?
In article <1170230024.658846.177070@v45g2000cwv.googlegroups.com>,
 "tum_" <atoumantsev_spam@mail.ru> wrote:

> On Jan 30, 7:27 pm, "Le Chaud Lapin" <jaibudu...@gmail.com> wrote: >> Division takes at least 2-3 times as long as multiplication using >> the fastest general-purpose long-division algorithm, described >> in Section 4.3.1 of one of Knuth's books. Many programmers are >> surprised to discover how complicated division is relative to >> multiplication when they implement a big integer library. > > That's my experience too, although I have to admit I've never > implemented long division myself, I used an existing implementation > for my RSA (and I remember it was based on Knuth). > I wouldn't expect a person like Burt Kaliski to publish the statement > that is that wrong, though. > I'd be interested in Francois's opinion on this matter as I remember > one of his posts claiming that classical ModExp algorithm (the one > using long division) can be as good as Montgomery approach (where no > classical division is used).
My two cents worth: In the context of RSA implementation in software, with k-bit modulus n, and a w*w->2w-bit hardware multiplication, use of asymptotically optimum algorithms (FFT, even Toom-Cook) is never advocated for realistic parameters (w>=8, k<=2048). Some advocate the Karatsuba method. Indeed it does pay, slightly, for multiplication, above some threshold (k/w > ~50), but: - efficient implementations of Karatsuba multiplication for variable size arguments are complex and hard to check; simple ones are less efficient. - use of Karatsuba multiplication largely increase both RAM and code footprint. - use of Karatsuba multiplication in modular reduction does NOT pay at all for practical k/w ratios, thus modular reduction (which cost, in any case, exceeds multiplication) limits any overall gain in using Karatsuba. - Karatsuba breaks simultaneous multiplication and modular reduction, a technique with many advantages, including reduced RAM footprint and simpler arguments indexing. [the sequential approach to computing x*y mod n is to compute the 2k-bit x*y, then reduce it modulo n to k-bit; the simultaneous approach performs modular reduction of partial results, which remain about k-bit at all time] Thus practical implementations (that I know) all use methods which asymptotic cost is O(k^3) for the private-key RSA modexp, and O(k^2) for public-key with fixed exponent (e=3, 65537..). They can be classified according to at least two criteria: - Montgomery (vs natural) arithmetic - simultaneous (vs sequential) modular reduction. Montgomery arithmetic is popular for high-performance private-key modexp: the difficult quotient estimation in natural arithmetic modular reduction becomes simple; off-by-one (or two) quotient estimation (rare but inevitable in natural arithmetic) is plain gone; simultaneous modular reduction is quite easy. The main drawback is extra work at the beginning and end of computation, but the cost is marginal for private-key operation. Thus many high-performance crypto libraries use Montgomery with simultaneous multiplication and modular reduction (most hardware cryptoengines do that, too). I often recommend natural arithmetic with simultaneous multiplication and modular reduction. Yes it is possible, and when done properly, the constant c in the asymptotic cost o(c.k^2) of modular multiplication and reduction ends up the same as in Montgomery with simultaneous multiplication and modular reduction; there are only linearly more multiplication/adds. It does beat Montgomery (by saving on pre/post computations) in at least two cases - low fixed exponent (public-key only, e.g. as in checking and RSA signature) - tight memory
> (Francois, I'm 90% sure it was your post but if I mistook you > for someone else - sorry).
That most likely was a post of mine. Francois Grieu
On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1170230024.658846.177...@v45g2000cwv.googlegroups.com>, > > > > "tum_" <atoumantsev_s...@mail.ru> wrote: > > On Jan 30, 7:27 pm, "Le Chaud Lapin" <jaibudu...@gmail.com> wrote: > >> Division takes at least 2-3 times as long as multiplication using > >> the fastest general-purpose long-division algorithm, described > >> in Section 4.3.1 of one of Knuth's books. Many programmers are > >> surprised to discover how complicated division is relative to > >> multiplication when they implement a big integer library. > > > That's my experience too, although I have to admit I've never > > implemented long division myself, I used an existing implementation > > for my RSA (and I remember it was based on Knuth). > > I wouldn't expect a person like Burt Kaliski to publish the statement > > that is that wrong, though. > > I'd be interested in Francois's opinion on this matter as I remember > > one of his posts claiming that classical ModExp algorithm (the one > > using long division) can be as good as Montgomery approach (where no > > classical division is used). > > My two cents worth: > > In the context of RSA implementation in software, with > k-bit modulus n, and a w*w->2w-bit hardware multiplication, > use of asymptotically optimum algorithms (FFT, even Toom-Cook) > is never advocated for realistic parameters (w>=8, k<=2048). > > Some advocate the Karatsuba method. Indeed it does pay, slightly, > for multiplication, above some threshold (k/w > ~50), but:
I'd say that the threshold is very platform dependent. By using Karatsuba we reduce the number of multiplications but increase the number of additions/subtractions (and probably read/writes). So the threshold depends on the MUL vs ADD performance ratio for a given system. When I estimated this threshold for ARM7TDMI, where MUL is pretty fast, I came up to the conclusion that Karatsuba will start to pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32). While when a friend of mine worked at the same task for Z80 (no hardware mul at all) he reported very noticeable speed up on 256 bit numbers if not shorter.
> - efficient implementations of Karatsuba multiplication > for variable size arguments are complex and hard to check; > simple ones are less efficient. > - use of Karatsuba multiplication largely increase both RAM > and code footprint. > - use of Karatsuba multiplication in modular reduction > does NOT pay at all for practical k/w ratios, thus modular > reduction (which cost, in any case, exceeds multiplication) > limits any overall gain in using Karatsuba. > - Karatsuba breaks simultaneous multiplication and modular > reduction, a technique with many advantages, including > reduced RAM footprint and simpler arguments indexing. > [the sequential approach to computing x*y mod n is > to compute the 2k-bit x*y, then reduce it modulo n to > k-bit; the simultaneous approach performs modular reduction > of partial results, which remain about k-bit at all time] > > Thus practical implementations (that I know) all use methods > which asymptotic cost is O(k^3) for the private-key RSA modexp, > and O(k^2) for public-key with fixed exponent (e=3, 65537..). > They can be classified according to at least two criteria: > - Montgomery (vs natural) arithmetic > - simultaneous (vs sequential) modular reduction. > > Montgomery arithmetic is popular for high-performance > private-key modexp: the difficult quotient estimation in > natural arithmetic modular reduction becomes simple; > off-by-one (or two) quotient estimation (rare but inevitable > in natural arithmetic) is plain gone; simultaneous modular > reduction is quite easy. The main drawback is extra work > at the beginning and end of computation, but the cost is > marginal for private-key operation. Thus many high-performance > crypto libraries use Montgomery with simultaneous > multiplication and modular reduction (most hardware > cryptoengines do that, too). > > I often recommend natural arithmetic with simultaneous > multiplication and modular reduction. Yes it is possible, and > when done properly, the constant c in the asymptotic cost > o(c.k^2) of modular multiplication and reduction ends up > the same as in Montgomery with simultaneous multiplication > and modular reduction; there are only linearly more > multiplication/adds. It does beat Montgomery (by saving on > pre/post computations) in at least two cases > - low fixed exponent (public-key only, e.g. as in checking > and RSA signature) > - tight memory > > Francois Grieu
Thanks for keeping up the (interesting) thread. You reminded me of a problem that puzzled me a while ago and which I failed to solve for myself. Here's my post in sci.crypt from July, 2005 that remained unanswered: http://groups.google.com/group/sci.crypt/browse_thread/thread/2d8157967b3806a/e72cc68404f51033?lnk=gst&q=tum_&rnum=6#e72cc68404f51033 Basically, for the methods that you call 'simultaneous/sequential' Koc in his paper uses terms 'Integrated/Separated' correspondingly. And the time I was studying the paper I came to conclusion that it is not possible to combine any 'integrated' (simultaneous) method with the optimised squaring algorithm. And if this is true then we are really losing a lot of performance on exponents such as 65537, where squaring is the dominant operation. On exponent 3 this of course is not an issue. Hmm, it's just occurred to me that I'm talking about Mongomery arithmetics, not the natural one that you "advocate" ;) Anyway, do you have the experience of implementing the simultaneous approach for optimised squaring? Alexei.
"tum_" <atoumantsev_spam@mail.ru> wrote:

"[..]

[..] I remember
searching for it about 3 years ago and, iirc, there was no _free_
access to this particular article on the Dobbs site and I didn't
bother to subscribe and pay at that time."

I agree that a few years ago WWW.DDJ.com did not contain the CDs'
copies of the articles for gratis.

P.S. I am sorry for screwing the References field in news:epn9c7$88f$1@newsserver.cilea.it
In article <1170246700.858450.170720@h3g2000cwc.googlegroups.com>,
 "tum_" <atoumantsev_spam@mail.ru> wrote:

> On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote: > > > > In the context of RSA implementation in software, with > > k-bit modulus n, and a w*w->2w-bit hardware multiplication,(..) > > some advocate the Karatsuba method. Indeed it does pay, slightly, > > for multiplication, above some threshold (k/w > ~50) (..) > > I'd say that the threshold is very platform dependent. By using > Karatsuba we reduce the number of multiplications but increase the > number of additions/subtractions (and probably read/writes). So the > threshold depends on the MUL vs ADD performance ratio for a given > system. When I estimated this threshold for ARM7TDMI, where MUL is > pretty fast, I came up to the conclusion that Karatsuba will start to > pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32). > While when a friend of mine worked at the same task for Z80 (no > hardware mul at all) he reported very noticeable speed up on 256 bit > numbers if not shorter.
Indeed, the thresold is very platform dependent, for the most part because w (word size at input of hardware multilier) is platform independent. And on the Z80, there is not hardware multiplier, so w maybe should be considered about 3 (a compromize between 1, for a 1-bit processor, and 8, for an 8-bit processor with MUL, e.g. 8051). This brings above numbers more in balance.
> You reminded me of a problem that puzzled me a while ago and which I > failed to solve for myself. (..) I came to conclusion that it is not > possible to combine any 'integrated' (simultaneous) method with the > optimised squaring algorithm.
My experience and your conclusion are in full agreement.
> And if this is true then we are really losing a lot of performance > on exponents such as 65537, where squaring is the dominant operation.
The squaring trick saves 1/4 of elementary multiplications in a mulmod that is actually is sqrmod; about 2/3 to say 7/8 of the mulmod are sqrmod (depending on memory constraints / outer exponentiation techniques), lowering that potential gain to say 20%. **But** the squaring trick does not save on additions, and much to the contrary makes half of them more complex to implement. This considerably lower the gain (on machines with hardware multiplication). The worse IMHO is that the squaring trick at least collides with simultaneous/integrated modular reduction, which does save considerably on memory acesses and index computations.
> Hmm, it's just occurred to me that I'm talking about Mongomery > arithmetics, not the natural one that you "advocate" ;)
My discussion applies to both.
> Anyway, do you have the experience of implementing the simultaneous > approach for optimised squaring?
No. I won't bet that it can't be done, but fail to see how. Francois Grieu
In article <fgrieu-46BDB7.14353731012007@news-4.proxad.net>,
I mistakenly wrote:

> w (word size at input of hardware multilier) is platform > independent.
That should be platform-dependent, of course. Francois Grieu
On Jan 31, 1:52 pm, Francois Grieu <fgr...@gmail.com> wrote:
> In article <fgrieu-46BDB7.14353731012...@news-4.proxad.net>, > > I mistakenly wrote: > > w (word size at input of hardware multilier) is platform > > independent. > > That should be platform-dependent, of course. > > Francois Grieu
I haven't notice the typo since it was obvious what you meant. I'll probably add some comments a bit later, I need to read your post again more carefully (no time at the moment).
On Jan 31, 1:35 pm, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1170246700.858450.170...@h3g2000cwc.googlegroups.com>, > > > > "tum_" <atoumantsev_s...@mail.ru> wrote: > > On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote: > > > > In the context of RSA implementation in software, with > > > k-bit modulus n, and a w*w->2w-bit hardware multiplication,(..) > > > some advocate the Karatsuba method. Indeed it does pay, slightly, > > > for multiplication, above some threshold (k/w > ~50) (..) > > > I'd say that the threshold is very platform dependent. By using > > Karatsuba we reduce the number of multiplications but increase the > > number of additions/subtractions (and probably read/writes). So the > > threshold depends on the MUL vs ADD performance ratio for a given > > system. When I estimated this threshold for ARM7TDMI, where MUL is > > pretty fast, I came up to the conclusion that Karatsuba will start to > > pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32). > > While when a friend of mine worked at the same task for Z80 (no > > hardware mul at all) he reported very noticeable speed up on 256 bit > > numbers if not shorter. > > Indeed, the thresold is very platform dependent, for the most part > because w (word size at input of hardware multilier) is platform > independent. And on the Z80, there is not hardware multiplier, so w > maybe should be considered about 3 (a compromize between 1, for a > 1-bit processor, and 8, for an 8-bit processor with MUL, e.g. 8051). > This brings above numbers more in balance.
I'm ready to agree that your 'k/w = ~50' can be used as a rough estimate and is probably applicable to many (or most) real life platforms. But I still see it as a correllation of MUL vs ADD instruction performance ratio. Consider, for example, two absolutely identical platforms where the only difference is that on platform 1 the MUL instr takes 3 cycles, while on platform 2 it takes 20 cycles. You formula won't work :-) (In general, it's hard to produce any simple & precise formula as there's quite a lot of other factors involved - like memory speed, for instance. Hence, if you want to be precise, the only way to go is implement and measure.)
> My experience and your conclusion are in full agreement. > > > And if this is true then we are really losing a lot of performance > > on exponents such as 65537, where squaring is the dominant operation. > > The squaring trick saves 1/4 of elementary multiplications > in a mulmod that is actually is sqrmod; about 2/3 to say 7/8 of the mulmod > are sqrmod (depending on memory constraints / outer exponentiation > techniques), lowering that potential gain to say 20%. > **But** the squaring trick does not save on additions, and much to the > contrary makes half of them more complex to implement. This considerably > lower the gain (on machines with hardware multiplication). > The worse IMHO is that the squaring trick at least collides with > simultaneous/integrated modular reduction, which does save considerably > on memory acesses and index computations.
Still need to think this over a bit more... But the reasoning sounds quite correct.
> > Hmm, it's just occurred to me that I'm talking about Mongomery > > arithmetics, not the natural one that you "advocate" ;) > > My discussion applies to both. > > > Anyway, do you have the experience of implementing the simultaneous > > approach for optimised squaring? > > No. I won't bet that it can't be done, but fail to see how. > > Francois Grieu
PS. I've only recently realised that this discussion is not in the sci.crypt, which probably explains the absence of several well known people in this thread I would expect to give their comments on the topic...
same story again, the post went to a black hole. Re-posting:

On Jan 31, 1:35 pm, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1170246700.858450.170...@h3g2000cwc.googlegroups.com>, > > > > "tum_" <atoumantsev_s...@mail.ru> wrote: > > On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote: > > > > In the context of RSA implementation in software, with > > > k-bit modulus n, and a w*w->2w-bit hardware multiplication,(..) > > > some advocate the Karatsuba method. Indeed it does pay, slightly, > > > for multiplication, above some threshold (k/w > ~50) (..) > > > I'd say that the threshold is very platform dependent. By using > > Karatsuba we reduce the number of multiplications but increase the > > number of additions/subtractions (and probably read/writes). So the > > threshold depends on the MUL vs ADD performance ratio for a given > > system. When I estimated this threshold for ARM7TDMI, where MUL is > > pretty fast, I came up to the conclusion that Karatsuba will start to > > pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32). > > While when a friend of mine worked at the same task for Z80 (no > > hardware mul at all) he reported very noticeable speed up on 256 bit > > numbers if not shorter. > > Indeed, the thresold is very platform dependent, for the most part > because w (word size at input of hardware multilier) is platform > independent. And on the Z80, there is not hardware multiplier, so w > maybe should be considered about 3 (a compromize between 1, for a > 1-bit processor, and 8, for an 8-bit processor with MUL, e.g. 8051). > This brings above numbers more in balance.
I'm ready to agree that your 'k/w = ~50' can be used as a rough estimate and is probably applicable to many (or most) real life platforms. But I still see it as a correllation of MUL vs ADD instruction performance ratio. Consider, for example, two absolutely identical platforms where the only difference is that on platform 1 the MUL instr takes 3 cycles, while on platform 2 it takes 20 cycles. You formula won't work :-) (In general, it's hard to produce any simple & precise formula as there's quite a lot of other factors involved - like memory speed, for instance. Hence, if you want to be precise, the only way to go is implement and measure.)
> My experience and your conclusion are in full agreement. > > > And if this is true then we are really losing a lot of performance > > on exponents such as 65537, where squaring is the dominant operation. > > The squaring trick saves 1/4 of elementary multiplications > in a mulmod that is actually is sqrmod; about 2/3 to say 7/8 of the mulmod > are sqrmod (depending on memory constraints / outer exponentiation > techniques), lowering that potential gain to say 20%. > **But** the squaring trick does not save on additions, and much to the > contrary makes half of them more complex to implement. This considerably > lower the gain (on machines with hardware multiplication). > The worse IMHO is that the squaring trick at least collides with > simultaneous/integrated modular reduction, which does save considerably > on memory acesses and index computations.
Still need to think this over a bit more... But the reasoning sounds quite correct.
> > Hmm, it's just occurred to me that I'm talking about Mongomery > > arithmetics, not the natural one that you "advocate" ;) > > My discussion applies to both. > > > Anyway, do you have the experience of implementing the simultaneous > > approach for optimised squaring? > > No. I won't bet that it can't be done, but fail to see how. > > Francois Grieu
PS. I've only recently realised that this discussion is not in the sci.crypt, which probably explains the absence of several well known people in this thread I would expect to give their comments on the topic...