EmbeddedRelated.com
Forums

RSA on 8bit micro

Started by W Trembyle January 24, 2007
On 31 jan, 22:14, "tum_" <atoumantsev_s...@mail.ru> wrote:
> On Jan 31, 1:35 pm, Francois Grieu <fgr...@gmail.com> wrote: > > "tum_" <atoumantsev_s...@mail.ru> wrote: > > > On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote: > > > > > > > In the context of RSA implementation in software, with > > > > k-bit modulus n, and a w*w->2w-bit hardware multiplication,(..) > > > > some advocate the Karatsuba method. Indeed it does pay, slightly, > > > > for multiplication, above some threshold (k/w > ~50) (..) > > > > > > I'd say that the threshold is very platform dependent. By using > > > Karatsuba we reduce the number of multiplications but increase the > > > number of additions/subtractions (and probably read/writes). So the > > > threshold depends on the MUL vs ADD performance ratio for a given > > > system. When I estimated this threshold for ARM7TDMI, where MUL is > > > pretty fast, I came up to the conclusion that Karatsuba will start to > > > pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32). > > > While when a friend of mine worked at the same task for Z80 (no > > > hardware mul at all) he reported very noticeable speed up on 256 bit > > > numbers if not shorter. > > > > Indeed, the thresold is very platform dependent, for the most part > > because w (word size at input of hardware multilier) is platform > > dependent. And on the Z80, there is not hardware multiplier, so w > > maybe should be considered about 3 (a compromize between 1, for a > > 1-bit processor, and 8, for an 8-bit processor with MUL, e.g. 8051). > > This brings above numbers more in balance. > > I'm ready to agree that your 'k/w = ~50' can be used as a rough > estimate and is probably applicable to many (or most) real life > platforms. But I still see it as a correllation of MUL vs ADD > instruction performance ratio. > Consider, for example, two absolutely identical platforms where the > only difference is that on platform 1 the MUL instr takes 3 cycles, > while on platform 2 it takes 20 cycles. You formula won't work :-) > (In general, it's hard to produce any simple & precise formula as > there's quite a lot of other factors involved - like memory speed, for > instance. Hence, if you want to be precise, the only way to go is > implement and measure.)
Yes, performance of hardware mult shifts the Karatsuba thresold markedly (fast => thresold up). The other significant parameter is IMHO not addition speed per se (addition is as fast as an instruction can get on most platforms), but time to read/write a word in memory (slow=> thresold up); on low-end embedded platform (the original topic) this is more dependent on the adressing technique used than on anything else. This is why efficient Karatsuba code tend to be hairy: it must go to great length to reduce adressing overhead. This issue of adressing overhead is also why simulataneous/integrated modular reduction works so well.
> PS. I've only recently realised that this discussion is not in the > sci.crypt, which probably explains the absence of several well known > people in this thread I would expect to give their comments on the > topic...
On the other hand, sci.crypt is supposed to be about science, thus I should not tease people about simulataneous/integrated modular reduction in natural arithmetic without explanation or reference... Francois Grieu
On Feb 1, 7:02 am, "Francois Grieu" <fgr...@gmail.com> wrote:
> On 31 jan, 22:14, "tum_" <atoumantsev_s...@mail.ru> wrote: > > > > > On Jan 31, 1:35 pm, Francois Grieu <fgr...@gmail.com> wrote: > > > "tum_" <atoumantsev_s...@mail.ru> wrote: > > > > On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote: >
[snipped to cut the size]
> Yes, performance of hardware mult shifts the Karatsuba thresold > markedly (fast => thresold up). The other significant parameter is > IMHO not addition speed per se (addition is as fast as an instruction
Absolutely, I was talking about 'MUL speed'/'ADD speed', the ratio.
> can get on most platforms), but time to read/write a word in memory > (slow=> thresold up); on low-end embedded platform (the original > topic) this is more dependent on the adressing technique used > than on anything else. This is why efficient Karatsuba code tend to > be hairy: it must go to great length to reduce adressing overhead. > This issue of adressing overhead is also why simulataneous/integrated > modular reduction works so well.
Agreed. What's the least powerful cpu you have had implemented RSA for? Was it some PIC model or was there something even more inferiour?
> > PS. I've only recently realised that this discussion is not in the > > sci.crypt, which probably explains the absence of several well known > > people in this thread I would expect to give their comments on the > > topic... > > On the other hand, sci.crypt is supposed to be about science, thus > I should not tease people about simulataneous/integrated modular > reduction in natural arithmetic without explanation or reference... > > Francois Grieu
I've sent a comment on this to your email ;)
In article <1170331448.896946.93070@m58g2000cwm.googlegroups.com>,
 "tum_" <atoumantsev_spam@mail.ru> wrote:

> What's the least powerful cpu you have had implemented RSA > for? Was it some PIC model or was there something even more > inferiour?
6502, back in 1979. No hardware multiplier. Did multiplication using shift-and-add, with the core loop unrolled and constructed dynamically at runtime according to size of arguments (I now know better, and all my never code is romable). More seriously, I have done RSA on 8-bit micros with 8x8 multiplier, the lowest-end certainly is the 6805. Also some ST7 variant, the 6303 aka 6803 (forgot how it worked), and most importantly 8051 and PIC18 (I can't rank these two nice but dissimilar architectures), 68HC11 (has 16-bit addition, but instructions take so many clocks). Also 680x0, ARM.. Again, some numbers for the public key operation, as used in signature verification. From the spec of our modexp library: Computes a^e mod n, with 0<e<2^32, and modulus n odd of runtime-specified size, up to and including 2048 bit. 8051: execution time 0.47 s for e=3, 1024 bit, @40Mhz Xtall (3.33 MCycles/s) 820 bytes of code 768 bytes of contiguous, aligned XDATA RAM includes 256 bytes result but not inputs 6 other contiguous XDATA bytes 2 DATA bytes (only; was a customer requirement) 1 BDATA byte (bit-adressable) 7 bytes of stack, including return address 68HC11: execution time 0.49 s for e=3, 1024 bit, @20MHz Xtall (5 MHz EClk) 1260 bytes of code 782 bytes of RAM includes 256 bytes result but not inputs 9 bytes of zero-page RAM 9 bytes of stack, including return address Runtime is nearly halved using e=2 (Rabin signature, functionaly equivalent to RSA signature). Francois Grieu
On Feb 1, 1:19 pm, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1170331448.896946.93...@m58g2000cwm.googlegroups.com>, > > "tum_" <atoumantsev_s...@mail.ru> wrote: > > What's the least powerful cpu you have had implemented RSA > > for? Was it some PIC model or was there something even more > > inferiour? > > 6502, back in 1979. No hardware multiplier. Did multiplication
Never dealt with this one but http://en.wikipedia.org/wiki/MOS_6502 gives the idea of what it was. Looks like Z80 is quite superior compared to this one.
> using shift-and-add, with the core loop unrolled and constructed > dynamically at runtime according to size of arguments (I now > know better, and all my never code is romable).
Duff's device? :) I've implemented the unrolled loop for my rsa on ARM using this technique (and was proud to learn afterwards that what I invented myself is called 'Duff's device ;))
> More seriously, I have done RSA on 8-bit micros with 8x8 multiplier, > the lowest-end certainly is the 6805. Also some ST7 variant, > the 6303 aka 6803 (forgot how it worked), and most importantly > 8051 and PIC18 (I can't rank these two nice but dissimilar > architectures), 68HC11 (has 16-bit addition, but instructions > take so many clocks). Also 680x0, ARM.. > > Again, some numbers for the public key operation, as used in > signature verification. From the spec of our modexp library:
These numbers are in your 1st post in this thread. Too bad you don't have Z80 numbers - that would interest me more.
> Computes a^e mod n, with 0<e<2^32, and modulus n odd of > runtime-specified size, up to and including 2048 bit. > > 8051: execution time 0.47 s > for e=3, 1024 bit, @40Mhz Xtall (3.33 MCycles/s) > 820 bytes of code > 768 bytes of contiguous, aligned XDATA RAM > includes 256 bytes result but not inputs > 6 other contiguous XDATA bytes > 2 DATA bytes (only; was a customer requirement) > 1 BDATA byte (bit-adressable) > 7 bytes of stack, including return address > > 68HC11: execution time 0.49 s > for e=3, 1024 bit, @20MHz Xtall (5 MHz EClk) > 1260 bytes of code > 782 bytes of RAM > includes 256 bytes result but not inputs > 9 bytes of zero-page RAM > 9 bytes of stack, including return address > > Runtime is nearly halved using e=2 (Rabin signature, > functionaly equivalent to RSA signature). > > Francois Grieu
> > 8051: execution time 0.47 s > > for e=3, 1024 bit, @40Mhz Xtall (3.33 MCycles/s) > > 820 bytes of code > > 768 bytes of contiguous, aligned XDATA RAM > > includes 256 bytes result but not inputs > > 6 other contiguous XDATA bytes > > 2 DATA bytes (only; was a customer requirement) > > 1 BDATA byte (bit-adressable) > > 7 bytes of stack, including return address > > > 68HC11: execution time 0.49 s > > for e=3, 1024 bit, @20MHz Xtall (5 MHz EClk) > > 1260 bytes of code > > 782 bytes of RAM > > includes 256 bytes result but not inputs > > 9 bytes of zero-page RAM > > 9 bytes of stack, including return address > > > Runtime is nearly halved using e=2 (Rabin signature, > > functionaly equivalent toRSAsignature). > > > Francois Grieu
Nice numbers... It took me a couple of days to port a library and implement RSA on a PIC18. Execution time: 120 second with a 1024 bit key! :) but is free and open source. Take a look here: http://ortegaalfredo.googlepages.com/pic18rsa Regards, Alfred
On Mar 4, 1:37 pm, ortegaalfr...@gmail.com wrote:
> Nice numbers... > It took me a couple of days to port a library and implement RSA on a > PIC18. > Execution time: 120 second with a 1024 bit key! :) but is free and > open source. Take a look here: > > http://ortegaalfredo.googlepages.com/pic18rsa > > Regards, > > Alfred
Nice reference implementation for the original poster of this thread. But the numbers you're giving are a bit misleading: your site quotes '120 sec. for 512 bit' and the readme.txt quotes '240 sec. for 512 bit'. I'm not quite sure which numbers to trust...