RSA on 8bit micro| page 4

Reply by Francois Grieu ●February 1, 20072007-02-01

On 31 jan, 22:14, "tum_" <atoumantsev_s...@mail.ru> wrote:
> On Jan 31, 1:35 pm, Francois Grieu <fgr...@gmail.com> wrote:
> >  "tum_" <atoumantsev_s...@mail.ru> wrote:
> > > On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote:
> > >
> > > > In the context of RSA implementation in software, with
> > > > k-bit modulus n, and a w*w->2w-bit hardware multiplication,(..)
> > > > some advocate the Karatsuba method. Indeed it does pay, slightly,
> > > > for multiplication, above some threshold (k/w > ~50) (..)
> > >
> > > I'd say that the threshold is very platform dependent. By using
> > > Karatsuba we reduce the number of multiplications but increase the
> > > number of additions/subtractions (and probably read/writes). So the
> > > threshold depends on the MUL vs ADD performance ratio for a given
> > > system. When I estimated this threshold for ARM7TDMI, where MUL is
> > > pretty fast, I came up to the conclusion that Karatsuba will start to
> > > pay off at around 1800-1900 bit lengths (k/w = almost 60, where w=32).
> > > While when a friend of mine worked at the same task for Z80 (no
> > > hardware mul at all) he reported very noticeable speed up on 256 bit
> > > numbers if not shorter.
> >
> > Indeed, the thresold is very platform dependent, for the most part
> > because w (word size at input of hardware multilier) is platform
> > dependent. And on the Z80, there is not hardware multiplier, so w
> > maybe should be considered about 3 (a compromize between 1, for a
> > 1-bit processor, and 8, for an 8-bit processor with MUL, e.g. 8051).
> > This brings above numbers more in balance.
>
> I'm ready to agree that your 'k/w = ~50' can be used as a rough
> estimate and is probably applicable to many (or most) real life
> platforms. But I still see it as a correllation of MUL vs ADD
> instruction performance ratio.
> Consider, for example, two absolutely identical platforms where the
> only difference is that on platform 1 the MUL instr takes 3 cycles,
> while on platform 2 it takes 20 cycles. You formula won't work :-)
> (In general, it's hard to produce any simple & precise formula as
> there's quite a lot of other factors involved - like memory speed, for
> instance. Hence, if you want to be precise, the only way to go is
> implement and measure.)

Yes, performance of hardware mult shifts the Karatsuba thresold
markedly (fast => thresold up). The other significant parameter is
IMHO not addition speed per se (addition is as fast as an instruction
can get on most platforms), but time to read/write a word in memory
(slow=> thresold up); on low-end embedded platform (the original
topic) this is more dependent on the adressing technique used
than on anything else. This is why efficient Karatsuba code tend to
be hairy: it must go to great length to reduce adressing overhead.
This issue of adressing overhead is also why simulataneous/integrated
modular reduction works so well.


> PS. I've only recently realised that this discussion is not in the
> sci.crypt, which probably explains the absence of several well known
> people in this thread I would expect to give their comments on the
> topic...

On the other hand, sci.crypt is supposed to be about science, thus
I should not tease people about simulataneous/integrated modular
reduction in natural arithmetic without explanation or reference...


  Francois Grieu

Reply by tum_ ●February 1, 20072007-02-01

On Feb 1, 7:02 am, "Francois Grieu" <fgr...@gmail.com> wrote:
> On 31 jan, 22:14, "tum_" <atoumantsev_s...@mail.ru> wrote:
>
>
>
> > On Jan 31, 1:35 pm, Francois Grieu <fgr...@gmail.com> wrote:
> > >  "tum_" <atoumantsev_s...@mail.ru> wrote:
> > > > On Jan 31, 11:02 am, Francois Grieu <fgr...@gmail.com> wrote:
>
[snipped to cut the size]

> Yes, performance of hardware mult shifts the Karatsuba thresold
> markedly (fast => thresold up). The other significant parameter is
> IMHO not addition speed per se (addition is as fast as an instruction

Absolutely, I was talking about 'MUL speed'/'ADD speed', the ratio.

> can get on most platforms), but time to read/write a word in memory
> (slow=> thresold up); on low-end embedded platform (the original
> topic) this is more dependent on the adressing technique used
> than on anything else. This is why efficient Karatsuba code tend to
> be hairy: it must go to great length to reduce adressing overhead.
> This issue of adressing overhead is also why simulataneous/integrated
> modular reduction works so well.

Agreed. What's the least powerful cpu you have had implemented RSA
for? Was it some PIC model or was there something even more inferiour?

> > PS. I've only recently realised that this discussion is not in the
> > sci.crypt, which probably explains the absence of several well known
> > people in this thread I would expect to give their comments on the
> > topic...
>
> On the other hand, sci.crypt is supposed to be about science, thus
> I should not tease people about simulataneous/integrated modular
> reduction in natural arithmetic without explanation or reference...
>
>   Francois Grieu

I've sent a comment on this to your email ;)

Reply by Francois Grieu ●February 1, 20072007-02-01

In article <1170331448.896946.93070@m58g2000cwm.googlegroups.com>,
 "tum_" <atoumantsev_spam@mail.ru> wrote:

> What's the least powerful cpu you have had implemented RSA
> for? Was it some PIC model or was there something even more
> inferiour?

6502, back in 1979. No hardware multiplier. Did multiplication
using shift-and-add, with the core loop unrolled and constructed
dynamically at runtime according to size of arguments (I now
know better, and all my never code is romable).

More seriously, I have done RSA on 8-bit micros with 8x8 multiplier,
the lowest-end certainly is the 6805. Also some ST7 variant,
the 6303 aka 6803 (forgot how it worked), and most importantly
8051 and PIC18 (I can't rank these two nice but dissimilar
architectures), 68HC11 (has 16-bit addition, but instructions
take so many clocks). Also 680x0, ARM..

Again, some numbers for the public key operation, as used in
signature verification. From the spec of our modexp library:

Computes  a^e mod n, with  0<e<2^32, and modulus  n  odd of
runtime-specified size, up to and including 2048 bit.

8051:  execution time  0.47 s
       for e=3, 1024 bit, @40Mhz Xtall (3.33 MCycles/s)
  820 bytes of code
  768 bytes of contiguous, aligned XDATA RAM
      includes 256 bytes result but not inputs
    6 other contiguous XDATA bytes
    2 DATA bytes (only; was a customer requirement)
    1 BDATA byte (bit-adressable)
    7 bytes of stack, including return address

68HC11:  execution time  0.49 s
         for e=3, 1024 bit, @20MHz Xtall (5 MHz EClk)
 1260 bytes of code
  782 bytes of RAM
      includes 256 bytes result but not inputs
    9 bytes of zero-page RAM
    9 bytes of stack, including return address

Runtime is nearly halved using e=2 (Rabin signature,
functionaly equivalent to RSA signature).

   Francois Grieu

Reply by tum_ ●February 1, 20072007-02-01

On Feb 1, 1:19 pm, Francois Grieu <fgr...@gmail.com> wrote:
> In article <1170331448.896946.93...@m58g2000cwm.googlegroups.com>,
>
>  "tum_" <atoumantsev_s...@mail.ru> wrote:
> > What's the least powerful cpu you have had implemented RSA
> > for? Was it some PIC model or was there something even more
> > inferiour?
>
> 6502, back in 1979. No hardware multiplier. Did multiplication

Never dealt with this one but http://en.wikipedia.org/wiki/MOS_6502
gives the idea of what it was. Looks like Z80 is quite superior
compared to this one.

> using shift-and-add, with the core loop unrolled and constructed
> dynamically at runtime according to size of arguments (I now
> know better, and all my never code is romable).

Duff's device? :) I've implemented the unrolled loop for my rsa on ARM
using this technique (and was proud to learn afterwards that what I
invented myself is called 'Duff's device ;))

> More seriously, I have done RSA on 8-bit micros with 8x8 multiplier,
> the lowest-end certainly is the 6805. Also some ST7 variant,
> the 6303 aka 6803 (forgot how it worked), and most importantly
> 8051 and PIC18 (I can't rank these two nice but dissimilar
> architectures), 68HC11 (has 16-bit addition, but instructions
> take so many clocks). Also 680x0, ARM..
>
> Again, some numbers for the public key operation, as used in
> signature verification. From the spec of our modexp library:

These numbers are in your 1st post in this thread. Too bad you don't
have Z80 numbers - that would interest me more.

> Computes  a^e mod n, with  0<e<2^32, and modulus  n  odd of
> runtime-specified size, up to and including 2048 bit.
>
> 8051:  execution time  0.47 s
>        for e=3, 1024 bit, @40Mhz Xtall (3.33 MCycles/s)
>   820 bytes of code
>   768 bytes of contiguous, aligned XDATA RAM
>       includes 256 bytes result but not inputs
>     6 other contiguous XDATA bytes
>     2 DATA bytes (only; was a customer requirement)
>     1 BDATA byte (bit-adressable)
>     7 bytes of stack, including return address
>
> 68HC11:  execution time  0.49 s
>          for e=3, 1024 bit, @20MHz Xtall (5 MHz EClk)
>  1260 bytes of code
>   782 bytes of RAM
>       includes 256 bytes result but not inputs
>     9 bytes of zero-page RAM
>     9 bytes of stack, including return address
>
> Runtime is nearly halved using e=2 (Rabin signature,
> functionaly equivalent to RSA signature).
>
>    Francois Grieu

Reply by ●March 4, 20072007-03-04

> > 8051:  execution time  0.47 s
> >        for e=3, 1024 bit, @40Mhz Xtall (3.33 MCycles/s)
> >   820 bytes of code
> >   768 bytes of contiguous, aligned XDATA RAM
> >       includes 256 bytes result but not inputs
> >     6 other contiguous XDATA bytes
> >     2 DATA bytes (only; was a customer requirement)
> >     1 BDATA byte (bit-adressable)
> >     7 bytes of stack, including return address
>
> > 68HC11:  execution time  0.49 s
> >          for e=3, 1024 bit, @20MHz Xtall (5 MHz EClk)
> >  1260 bytes of code
> >   782 bytes of RAM
> >       includes 256 bytes result but not inputs
> >     9 bytes of zero-page RAM
> >     9 bytes of stack, including return address
>
> > Runtime is nearly halved using e=2 (Rabin signature,
> > functionaly equivalent toRSAsignature).
>
> >    Francois Grieu

Nice numbers...
It took me a couple of days to port a library and implement RSA on a
PIC18.
Execution time: 120 second with a 1024 bit key! :) but is free and
open source. Take a look here:

http://ortegaalfredo.googlepages.com/pic18rsa

Regards,

Alfred

Reply by tum_ ●March 5, 20072007-03-05

On Mar 4, 1:37 pm, ortegaalfr...@gmail.com wrote:
> Nice numbers...
> It took me a couple of days to port a library and implement RSA on a
> PIC18.
> Execution time: 120 second with a 1024 bit key! :) but is free and
> open source. Take a look here:
>
> http://ortegaalfredo.googlepages.com/pic18rsa
>
> Regards,
>
> Alfred

Nice reference implementation for the original poster of this thread.

But the numbers you're giving are a bit misleading: your site quotes
'120 sec. for 512 bit' and the readme.txt quotes '240 sec. for 512
bit'. I'm not quite sure which numbers to trust...

Previous 2 34Next

RSA on 8bit micro

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About EmbeddedRelated.com

Social Networks

The Related Media Group