I know most processors have an instruction for division and certainly Forth includes division words. But how are they implemented? I am asking because I need to do a division in hardware in an FPGA. Are there processors that don't include a division instruction? Has anyone had to implement division using something other than a native CPU division instruction? I worked on an attached array processor many years ago (a floating point coprocessor in a rack cabinet) and it did division using Newton-Raphson iteration I believe. It converged quickly and took some 10 or 12 cycles to complete. Considering the multiply needed to carry it out took more than one clock cycle that's not bad. Are there better techniques for division? I guess there is long division, "shift and subtract" as compared to shift and add for multiplication. Fortunately multipliers are still common. -- Rick C. - Get 1,000 miles of free Supercharging - Tesla referral code - https://ts.la/richard11209

# Division

Started by ●August 25, 2020

Reply by ●August 25, 20202020-08-25

On 8/25/20 1:10 PM, Rick C wrote:> I know most processors have an instruction for division and certainly Forth includes division words. But how are they implemented? I am asking because I need to do a division in hardware in an FPGA. > > Are there processors that don't include a division instruction? Has anyone had to implement division using something other than a native CPU division instruction? > > I worked on an attached array processor many years ago (a floating point coprocessor in a rack cabinet) and it did division using Newton-Raphson iteration I believe. It converged quickly and took some 10 or 12 cycles to complete. Considering the multiply needed to carry it out took more than one clock cycle that's not bad. > > Are there better techniques for division? I guess there is long division, "shift and subtract" as compared to shift and add for multiplication. Fortunately multipliers are still common. >You might want to look at https://en.wikipedia.org/wiki/Division_algorithm Besides the simple long division you are describing, this also describes a couple other techniques, like SRT that can do multiple bits at once, and Newton-Raphson that works well for big numbers as each loop doubles the number of bits you have (but each step takes 2 multiplies)

Reply by ●August 25, 20202020-08-25

On 25/08/2020 19:10, Rick C wrote:> I know most processors have an instruction for division and certainly > Forth includes division words. But how are they implemented? I am > asking because I need to do a division in hardware in an FPGA. > > Are there processors that don't include a division instruction? Has > anyone had to implement division using something other than a native > CPU division instruction? >There are /lots/ of processors that don't have division instructions. It was pretty much standard practice for small microcontrollers to avoid hardware division instructions until the Cortex M3 became the standard. (Cortex M0, used for the cheapest and lowest power ARM microcontrollers, and Cortex M1 for soft cores in FGPAs, do not have hardware division.) Hardware division instructions are expensive in die space and relatively rarely used in software - they are simply not worth the cost for many microcontroller systems. Also, hardware division is almost invariably multi-cycle, and the division execution unit does not lend itself to pipelining, which does not fit well in a RISC processor when you are trying to get single cycle for a simple and predictable core. Some RISC designs skip division entirely, others have a "division step" instruction that is simple enough to be acceptable in small, low-power hardware, but speeds up software division routines. Even on larger devices, division is a pain - requiring multiple cycles and without pipelining, a division is often an order of magnitude slower in throughput and latency than a multiplication, so compilers do their best to avoid them (preferring to multiply by the reciprocal where possible). If my history is correct, the Motorola 68k family had hardware division instructions until someone realised (on the 68020, I think) that pure software division routines were /faster/ than the hardware division instructions. (Very small microcontrollers don't even have multiply instructions, but these are getting rare now.)> I worked on an attached array processor many years ago (a floating > point coprocessor in a rack cabinet) and it did division using > Newton-Raphson iteration I believe. It converged quickly and took > some 10 or 12 cycles to complete. Considering the multiply needed to > carry it out took more than one clock cycle that's not bad. > > Are there better techniques for division? I guess there is long > division, "shift and subtract" as compared to shift and add for > multiplication. Fortunately multipliers are still common. >"Better" depends on your requirements. Do you need to minimise hardware size, maximize throughput, minimise latency? Do you need exact results? (Not every use-case does.) Will the denominator be re-used for multiple divisions, or even constant? What are your sizes? As Richard says, the Wikipedia article can give you some suggestions (in addition to the ones you mentioned here). But if you can take advantage of additional information about the situation, you might be able to do better than standard general-purpose algorithms.

Reply by ●August 25, 20202020-08-25

In article <44d87377-6730-4e48-90ef-f3ac49d2b0e5o@googlegroups.com>, Rick C <gnuarm.deletethisbit@gmail.com> wrote:>I know most processors have an instruction for division and certainly Forth includes division words. But how are they implemented? I am asking because I need to do a division in hardware >in an FPGA. > >Are there processors that don't include a division instruction? Has anyone had to implement division using something other than a native CPU division instruction? > >I worked on an attached array processor many years ago (a floating point coprocessor in a rack cabinet) and it did division using Newton-Raphson iteration I believe. It converged quickly >and took some 10 or 12 cycles to complete. Considering the multiply needed to carry it out took more than one clock cycle that's not bad. > >Are there better techniques for division? I guess there is long division, "shift and subtract" as compared to shift and add for multiplication. Fortunately multipliers are still common. >Rick, entirely dependent on your requirements. For < ~100 MHz, up to about ~11 bit divisors, just brute force it with long division. Yes, in a single clock cycle. If you need faster, then pipeline it. For > 11 bit divisors I'd start getting more exotic. Thinking ROMs to hold look up tables of inverse values. Then on to even more exotic Newton-Raphson (perhaps combined with the Look up ROMs). I've only ever needed to just brute force it (with pipelining). Of course if you're doing this in a smaller FPGA, then you may need to consider bit-serial like sequential algorithms that take longer to compute each division, at the expense of more complicated hardware and performance. Regard, Mark

Reply by ●August 26, 20202020-08-26

On Tuesday, August 25, 2020 at 7:31:45 PM UTC-4, gtwrek wrote:> In article <44d87377-6730-4e48-90ef-f3ac49d2b0e5o@googlegroups.com>, > Rick C <gnuarm.deletethisbit@gmail.com> wrote: > >I know most processors have an instruction for division and certainly Forth includes division words. But how are they implemented? I am asking because I need to do a division in hardware > >in an FPGA. > > > >Are there processors that don't include a division instruction? Has anyone had to implement division using something other than a native CPU division instruction? > > > >I worked on an attached array processor many years ago (a floating point coprocessor in a rack cabinet) and it did division using Newton-Raphson iteration I believe. It converged quickly > >and took some 10 or 12 cycles to complete. Considering the multiply needed to carry it out took more than one clock cycle that's not bad. > > > >Are there better techniques for division? I guess there is long division, "shift and subtract" as compared to shift and add for multiplication. Fortunately multipliers are still common. > > > > Rick, entirely dependent on your requirements. > > For < ~100 MHz, up to about ~11 bit divisors, just brute force it with > long division. Yes, in a single clock cycle. If you need faster, then > pipeline it. > > For > 11 bit divisors I'd start getting more exotic. Thinking ROMs to > hold look up tables of inverse values. Then on to even more exotic > Newton-Raphson (perhaps combined with the Look up ROMs). > > I've only ever needed to just brute force it (with pipelining). > > Of course if you're doing this in a smaller FPGA, then you may need to > consider bit-serial like sequential algorithms that take longer to > compute each division, at the expense of more complicated hardware and > performance.I looked at a bit serial multiplier once and it didn't save much if anything over an 8 bit sequential (shift and add) multiplier. The control logic is much more complex for bit serial. I plan to do this in a small device, but don't have a feel yet for what needs to go in it. The part does have 16 bit hardware fast multipliers and the data rate is 100,000 times slower, so why not use the available resources. I guess I was really asking if there were better algorithms that work like Newton-Raphson but converge more quickly or something. From reading about Newton-Raphson if it is seeded with a small table lookup, it will converge quickly, but since time is in much greater supply than memory, it won't really matter much to start with a fixed value for every computation. In fact, it would be perfectly acceptable to calculate worse case numbers for iterations required and simply loop that many times regardless. Once the error is zero the result stops changing. This is also simplified by the fact that this is a way to match the ADC input range to the sensor output which are based on separate power supplies. It's not necessary to use an accurate reference on the ADC if that same ADC is used to measure the sensor power supply which provides the basis for the gain setting in the sensor output. So the divisor will be in a narrow range. Maybe a table lookup for that narrow range is the best way. With a 12 bit input even 200 entries will cover 10% (±5%). Darn, I keep simplifying away the fun stuff. -- Rick C. + Get 1,000 miles of free Supercharging + Tesla referral code - https://ts.la/richard11209

Reply by ●August 26, 20202020-08-26

In article <5c019f7b-4641-4e5b-99d8-9630936f564eo@googlegroups.com>, Rick C <gnuarm.deletethisbit@gmail.com> wrote:>So the divisor will be in a narrow range. Maybe a table lookup for that narrow range is the best way. With a 12 bit input even 200 entries will cover 10% (±5%). > >Darn, I keep simplifying away the fun stuff.I hear you. Sometimes these "do elementary school maths" with the technology at hand, under these sets of conditions, are very satisfying small side projects. Easy concepts, but sometimes tricky conditions to work under. This may frustrate some folks but I often find this sort of tasks quite satisfying... Regards, Mark

Reply by ●August 26, 20202020-08-26

On 8/25/2020 10:10 AM, Rick C wrote:> I know most processors have an instruction for division and certainly Forth includes division words. But how are they implemented? I am asking because I need to do a division in hardware in an FPGA. > > Are there processors that don't include a division instruction? Has anyone had to implement division using something other than a native CPU division instruction? > > I worked on an attached array processor many years ago (a floating point coprocessor in a rack cabinet) and it did division using Newton-Raphson iteration I believe. It converged quickly and took some 10 or 12 cycles to complete. Considering the multiply needed to carry it out took more than one clock cycle that's not bad. > > Are there better techniques for division? I guess there is long division, "shift and subtract" as compared to shift and add for multiplication. Fortunately multipliers are still common.Recalling an FPGA project I did about 10 years ago, I was able to implement FPGA processing blocks that could be integrated into the embedded core and present itself as a custom instruction. It worked very well for me. Fewer than 10 clocks would produce a single precision floating point result. The requirement was to compute std dev in less than 500 us. Emulated floating point divide and sqrt were time consuming operations at the time. Details: It was an Altera Cyclone III FPGA. I was using the Quartus tool set. I embedded a NIOS II core. Quartus was a huge tool featuring many different ways to get something done. The choice I made (custom instruction plus various Quartus supplied floating point math blocks) was not obvious but I eventually stumbled across my solution combo and used it successfully. JJS