This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
Wow, that Microblaze looks impressive. But I wonder how realisitic the architecture is and what got "scrapped". 800 LUTs is amazing. I wonder how fast the xr16 would execute in the same type of part. I'm wondering about the virtue of adding hardware multiply / divide support to a processor. It seems to me that multiply / divide is used so infrequently in general use that having hardware support is not worthwhile. I can easily add a multiply or divide step instruction to increase the performance of software multiply / divides significantly, but why bother ? Increasing the performance of an instruction that is executed only a fraction of a percentage of the total instructions executed seems pointless. So why do other general purpose processors provide hardware multiply / divide support ? Is this just marketing ? I'm thinking the less hardware there is, the better for performance... |
|
|
|
When you're making a chip with 30+ M transistors, you might as well put in multiply and divide. :) -Tom > -----Original Message----- > From: [mailto:] > Sent: Monday, April 09, 2001 10:57 PM > To: > Subject: [fpga-cpu] Multiplying, MicroBlaze > Wow, that Microblaze looks impressive. But I wonder how realisitic > the architecture is and what got "scrapped". 800 LUTs is amazing. I > wonder how fast the xr16 would execute in the same type of part. > > I'm wondering about the virtue of adding hardware multiply / divide > support to a processor. It seems to me that multiply / divide is used > so infrequently in general use that having hardware support is not > worthwhile. I can easily add a multiply or divide step instruction to > increase the performance of software multiply / divides > significantly, but why bother ? Increasing the performance of an > instruction that is executed only a fraction of a percentage of the > total instructions executed seems pointless. So why do other general > purpose processors provide hardware multiply / divide support ? Is > this just marketing ? I'm thinking the less hardware there is, the > better for performance... > > > To Post a message, send it to: > To Unsubscribe, send a blank message to: |
|
|
|
Veronica Merryfield wrote: > > This reasoning is almost exactly on the same lines that CPUs were origonally > introduced. Jan's work and WEB site mentions these and it is worth looking > at the Intel history site. Breifly, a logic collection was made that > operational code were fed to to perform complex logic functions that would > have required toomuch dedicated logic to be economic, a specialise CPU. The > driver over the years has been towards generalise CPUs (RISC/CISC/DSP) but > Jan's work is showing that this does not have to be the case anymore with > advances in silicon technology. I expect advances will be in Custom I/O connected to a generic CPU. Networking and other custom logic on the outside of the cpu is where the action in development is. Still one has to remember the FPGA market is a small market compared to the rest of the computer industry. Ben. -- "We do not inherit our time on this planet from our parents... We borrow it from our children." "Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk |
|
A collection of bits from came together thus >Of course, there are applications that use lots of multiplies. For a >desktop processor, 3D graphics uses huge numbers of multiplies. In >the microcontroller world, a friend of mine is building an audio mixer >that digitally mixes several audio channels (with adjustable volumes); >another very multiply-intensive application. I would argue that these are not general purpose CPU uses but specific or specialised uses. These examples do use many multiples and are probably best served by the creation of an instruction set that has single cycle mulitple and accumulate and also some interesting looping modes to suit those applications, perhaps also number scheme representations that work well, which is exactly the course that DSP manufactures have taken and also that of some FPGA vendors with DSP cores. The argument extends to other specialised fields in that if you are targetting a specific function set then it is wise to craft the instruction set and supporting tools to best match that function set. As I read Jan's work, he set out to demonstrate that it is possible to implement a CPU in an FPGA and produce usefull work from it. If you are in the market of making or wanting general purpose CPUs within FPGAs for what ever reasons (Jan's work covers these), then it is more cost effective to buy in the soft core and tools. However, the real win here is that for a given specialised application for those same reasons (quantity, upgradability etc) one has the knowledge that is feasible to create the instruction set to suit the application along with tools and refine these to achive the results in a very economic and flexible manor. This reasoning is almost exactly on the same lines that CPUs were origonally introduced. Jan's work and WEB site mentions these and it is worth looking at the Intel history site. Breifly, a logic collection was made that operational code were fed to to perform complex logic functions that would have required toomuch dedicated logic to be economic, a specialise CPU. The driver over the years has been towards generalise CPUs (RISC/CISC/DSP) but Jan's work is showing that this does not have to be the case anymore with advances in silicon technology. In summery, for general CPUs and thier spread of uses, I doubt a mulitple is needed. For specailise application where it would be of benfit, certainly. Veronica |
|
|
|
If you have a godd compiler you are right. According to HP, 95% of all multiplies are constant multiplies that can be reduce to a few adds and shifts. (The Java API for example multiplies with 37 like crazy in its hashtables) That is called strength reduction. If you have a stupid compiler, you end up with 1% or so multiplications in your code. If a multiplication step ist something like 5 cycles (shift, add, shift, compare, jump), and everything alse needs a single cycle, your runtimne in a 32-Bit system goes by 159%. (99 + 32*5 cycles) With a 1 cycle multiplication step it is only an extra 31%. (99 + 32 cycles) A serial multiplier that stops early if the remaining multiplicant is 0 reduzes this to about 5% (99 + 6 cycles with an everage of 6 bit operands) A single cycle multiplier will improve this to 0%. (99 + 1 cycle) So: The best thing to have is a good compiler. It will achieve about 105% in most benchmark without extra hardware. Otherwise a single cycle multiplication step is very worthwile. Dedicated multiplier only make sense if you do a lot of arithmetic in your code. CU, Kolja wrote: > Wow, that Microblaze looks impressive. But I wonder how realisitic > the architecture is and what got "scrapped". 800 LUTs is amazing. I > wonder how fast the xr16 would execute in the same type of part. > > I'm wondering about the virtue of adding hardware multiply / divide > support to a processor. It seems to me that multiply / divide is used > so infrequently in general use that having hardware support is not > worthwhile. I can easily add a multiply or divide step instruction to > increase the performance of software multiply / divides > significantly, but why bother ? Increasing the performance of an > instruction that is executed only a fraction of a percentage of the > total instructions executed seems pointless. So why do other general > purpose processors provide hardware multiply / divide support ? Is > this just marketing ? I'm thinking the less hardware there is, the > better for performance... |
|
writes: > I'm wondering about the virtue of adding hardware multiply / divide > support to a processor. It seems to me that multiply / divide is used > so infrequently in general use that having hardware support is not > worthwhile. I can easily add a multiply or divide step instruction to > increase the performance of software multiply / divides > significantly, but why bother ? Increasing the performance of an > instruction that is executed only a fraction of a percentage of the > total instructions executed seems pointless. So why do other general > purpose processors provide hardware multiply / divide support ? Is > this just marketing ? I'm thinking the less hardware there is, the > better for performance... Of course, there are applications that use lots of multiplies. For a desktop processor, 3D graphics uses huge numbers of multiplies. In the microcontroller world, a friend of mine is building an audio mixer that digitally mixes several audio channels (with adjustable volumes); another very multiply-intensive application. It seems like a good idea to spend some effort to make multiplies fast, if you're trying to make a general-purpose processor. Carl Witty |
|
thank you oh so much for this lovely waste of space, bytes, and bandwidth. please refrain from this when posting to mass groups. some of us are reading via very slow connections, and this is extremely annoying. > ## > ## # ## # # > # # # # # # # # > # # # # # # # # > # # # # ## # # > # # # # ## # # ## ##### > # # # # # # # # # # # # # > # # ## # # # # # # # # # # > # # # # # # # # ## # # ### # > # # # # # # # # # # # ## # > # # # # # # # # # # ## > # ## # # # # # # # # # ## # > # # # ## # # # # # # # # #### # > # # # # # # # # # # # # # # > # # # # # # # # # # # # # ## > ## ### ## ## ## ## #### |
|
|
|
> In summery, for general CPUs and thier spread of uses, I doubt a mulitple is > needed. For specailise application where it would be of benfit, certainly. > > Veronica Well, that depends on what you call a general CPU. Perhaps a general purpose CPU should be able to perform well on almost every field. Not excellent, but at least well. So if I want to use it on 3D Geometry or to process some audio signal and later on some word processing, perhaps I will be wasting some of the CPU power in the latter, but I will thank it on the first two. By the way, I think a little CPU designed to fit in a low cost FPGA to control some embeded system may well lost its multiplier, but then it IS a specific purpose CPU. (Of course, I'm not criticizing Jan's CPU in this paragraph) Salutations, Mike. ## ## # ## # # # # # # # # # # # # # # # # # # # # # # ## # # # # # # ## # # ## ##### # # # # # # # # # # # # # # # ## # # # # # # # # # # # # # # # # # # ## # # ### # # # # # # # # # # # # ## # # # # # # # # # # # ## # ## # # # # # # # # # ## # # # # ## # # # # # # # # #### # # # # # # # # # # # # # # # # # # # # # # # # # # # # ## ## ### ## ## ## ## #### |
|
Oh, sorry, really. I forgot that stupid signature. I'll be more carefull in the future. Organization: CoC, GaTech To: From: Josh Fryman <> Date sent: Tue, 10 Apr 2001 19:40:50 -0400 Send reply to: Subject: Re: [fpga-cpu] Multiplying, MicroBlaze > thank you oh so much for this lovely waste of space, bytes, and bandwidth. > > please refrain from this when posting to mass groups. some of us are reading > via very slow connections, and this is extremely annoying. |
|
Jan Gray wrote: > Xilinx has said that the forthcoming "10 M system gate" version of Virtex-II > will require 500 M transistors. > Jan Gray, Gray Research LLC Any guess at the cost of the first one? $10K..$20k comes to mind. A small Forth cpu is about 10,000 gates. Thats 1K of them in that mammoth piece of silicon.A lot of power there. Ben. -- "We do not inherit our time on this planet from our parents... We borrow it from our children." "Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk |
|
Kolja Sulimma wrote: > > The amount of multiplication is overestimated. The point is that some kernels > really depend on multiplication, but most applications do not really spend much > time in them The one place multiplication is hidden is in indexing variables, like foo[i]. Most cases this is a simple shift like 1,2,4x but if foo is a array of structures like stuct foobar foo[k]; you have to have a multiplication. Ben. -- "We do not inherit our time on this planet from our parents... We borrow it from our children." "Luna family of Octal Computers" http://www.jetnet.ab.ca/users/bfranchuk |
|
> Wow, that Microblaze looks impressive. But I wonder how realisitic > the architecture is and what got "scrapped". 800 LUTs is amazing. I > wonder how fast the xr16 would execute in the same type of part. The execute stage would go at approximately the same frequency. Control might need to be retimed for the faster interconnect relative to logic. The xr16's 16-bit ISA however would probably not get as much work done per cycle as a 32-bit instruction word architecture. Remember every imm instruction is in some sense a wasted issue slot. > I'm wondering about the virtue of adding hardware multiply / divide > support to a processor. Multiply is the bottleneck in some codes, especially in signal processing. Think sums of weighted inputs; each weighting is a multiplication. One reason that FPGAs are good at DSP, is these weight coefficients are constants, and each multiply by constant can be strength-reduced into a series of adds of certain taps of the input. Even so, Xilinx apparently thought variable multiplies are so important that Virtex-II provides a fast 18x18=36-bit hard multiplier at each 18 Kb block RAM site: 4 in a 2V40; 40 in a 2V1000; 144 in a 2V6000. Also note, it is possible to use a multiplier as a limited barrel shifter. (Barrel shifters are relatively expensive to implement in FPGAs.) Also it may therefore be possible to use multipliers to perform operand denormalization and result normalization for floating point addition. Jan Gray, Gray Research LLC |
|
Tom Kerrigan wrote > When you're making a chip with 30+ M transistors, you might as well put in > multiply and divide. :) Xilinx has said that the forthcoming "10 M system gate" version of Virtex-II will require 500 M transistors. http://www.eetimes.com/story/OEG20000522S0025 Jan Gray, Gray Research LLC |
|
wrote: > > In summery, for general CPUs and thier spread of uses, I doubt a mulitple is > > needed. For specailise application where it would be of benfit, certainly. > > > > Veronica > > Well, that depends on what you call a general CPU. Perhaps a general > purpose CPU should be able to perform well on almost every field. Not > excellent, but at least well. So if I want to use it on 3D Geometry or to > process some audio signal and later on some word processing, perhaps I will > be wasting some of the CPU power in the latter, but I will thank it on the > first two. For the audio processing, strength reduction would do fine, no dedicated multiplier needed there. (See also Jans later posting) > By the way, I think a little CPU designed to fit in a low cost FPGA to > control some embeded system may well lost its multiplier, but then it IS a > specific purpose CPU. (Of course, I'm not criticizing Jan's CPU in this > paragraph) As I said before, there are some very nice publications by HP why you usually do not need a multiplier in integer CPUs. If I recall correctly HP-PA had no integer multiplier up to the PA8500, and I would not call a PA8200 a specific purpose CPU. Also remember, that the i386, which is much larger than a xr16, needed something like 17 cycles for a multiplications. 68020 needed more than 50 cycles, and so on. The amount of multiplication is overestimated. The point is that some kernels really depend on multiplication, but most applications do not really spend much time in them I just did a trace on a Jmpg123, an mp3 decoder, and it only has 6% multiplies. Most of these are constants (windowing function, etc.) that can be removed by strength reduction. I can redo my previous calculation for this case, and report the relative performance for various implementations: 1000 cycles without multiplication support (32*5 cycle mulitplication) 400 cycles with strength reduction 290 cycles with repeat instruction and multiply step 190 cycles with above and strengt reduction 280 cycles with 32 cycle multiplier 200 cycles with i386 multiplier 106 cycles with single cycle multiplier All this assumes 0 wait state memory, do pipeline stall. Both of which would reduce the multiplier merit. This means, in mp3 decoding you get a dactor of 2.5 by using strength reduction. You get another factor of 2 by adding a multiplication step instruction or a 32 cycle multiplier. The single cycle multiplier gain another factor of 1.8 The first step is free, the second step is cheap. The third step is very expensive and will also hurt your cycle time. (In Virtex a single cycle multiply is more than 20ns) Two processors with only step 1 and 2 implemented are likely to be faster and smaller. CU, Kolja Most of these |
|
|
|
> The one place multiplication is hidden is in indexing variables, like foo[i]. > Most cases this is a simple shift like 1,2,4x but if foo is a array > of structures like stuct foobar foo[k]; you have to have a multiplication. > Ben. A structure would still imply only a constant coefficient multiply which is only a couple of cycles anyway if you use lea and shift instructions. 99% of the 16 Bit constant multiplies can be done with 4 additions. Some compilers optionally align structures to powers of 2. Arrays of arrays is the intresting stuff. Kolja |
|
> This turns 'mulc rd,ra,6' into > mov r1,ra > mov rd,r0 > slli r1,1 > add rd,r1,rd > slli r1,1 > add rd,r1,rd > > Much better than calling _mulu2 or whatever. A single shift+add instruction > would have been nice but probably not worth the extra area (10% as usual). This would do the same: add rd, ra, ra add rd, rd, ra slli rd, 1 Finding optimum addition chains is NP-Complete, but with dynamic programming you can find all chains for 16 Bit constant multiplies in a couple of minutes. If a constant has length n with x bits set to one, your code creates n+x+2 instructions. Here is my small contribution to XSOC. Untested code that creates only n + x - 1 instructions for the same task. However, one can write very simple code that uses subtractions to uses allways less than 1.5*n instructions.(If there are more 0s than 1s use subtractions instead of addions.) case MULC: /* mulc rd,ra,k => mov r1,ra || mov rd,r0 || { [add rd,r1,rd] || slli r1,1 }* */ if (!parse(&p, REG, &rd, ',', REG, &ra, ',', 0) || !constant(&p, &con) || !parse(&p, EOL, 0)) continue; //check for zero if (con == 0) { move(rd.u.reg, 0); break; } //find first bit set while (! (con & bitvalue)) bitvalue >>=1; mov(rd.u.reg, ra.u.reg); bitvalue >>= 1; while ((bitvalue >>= 1) > 0) { insn(SLLI, INSN_RD(1) | INSN_I4(1)); if (con&bitvalue) insn(ADD, INSN_RD(rd.u.reg) | INSN_RA(1) | INSN_RB(rd.u.reg)); } break; |
|
Kolja Sulimma <> writes: > For the audio processing, strength reduction would do fine, no dedicated > multiplier needed there. > (See also Jans later posting) I don't understand that. Surely for something like a mixer, you need multiplication? (Unless you do something exotic like dynamic recompilation, which seems a little heavyweight for an embedded system.) Carl Witty |
|
wrote: > Kolja Sulimma <> writes: > > > For the audio processing, strength reduction would do fine, no dedicated > > multiplier needed there. > > (See also Jans later posting) > > I don't understand that. Surely for something like a mixer, you need > multiplication? (Unless you do something exotic like dynamic > recompilation, which seems a little heavyweight for an embedded > system.) A mixer is only one multiplication per channel per sample. That's less than 100k multiplies per second for two stereo channels. But the equalizer, reverb, etc. can do without. They usually only need one variable multiplier each for the gain. Of course if this stuff really starts to dominate your system, you should at some dsp features to cour cpu. CU, Kolja |