Dear all, I believe I have invented a formula for quotient calculation that works for x/63 and x/127. The formula is as follows and it is divisionless and multiplierless: y = (((x>>n)+x+((1<<n)+1))>>n)-1; Use n=6 for 63, and n=7 for 127. 1<<n is the strength-reduced form for calculating 2^n. It does not require any double-word arithmetic (like with multiplying by multiplicative inverse and then performing adjustment steps). It does not seem to work for other dividers, but a systematic fix may be possible. Are there any known references on this? Best regards Nikolaos Kavvadias http://www.nkavvadias.com http://github.com/nkkav/
Algorithm for x/63 and x/127
Started by ●October 11, 2014
Reply by ●October 11, 20142014-10-11
Nikolaos Kavvadias <nikolaos.kavvadias@gmail.com> wrote:> I believe I have invented a formula for quotient calculation > that works for x/63 and x/127.> The formula is as follows and it is divisionless and multiplierless:> y = (((x>>n)+x+((1<<n)+1))>>n)-1;> Use n=6 for 63, and n=7 for 127. 1<<n is the strength-reduced > form for calculating 2^n.You can do integer division by multiplying by the appropriate multiple of the reciprocal and truncating. This is easiest to do with a multiply that generates a double sized product in two registers. The reciprocal needs, in some cases, one more significant bit that the word size. 1/63 is, in binary, 0.00000100000100001 ...> It does not require any double-word arithmetic (like with > multiplying by multiplicative inverse and then performing > adjustment steps). It does not seem to work for other dividers, > but a systematic fix may be possible.For big enough values, it won't be right. Those may be bigger than you need. For n=6, it seems to fail at 4095, and for n=7 and 16383. Traditionally, integer division truncates. I am not sure about the 1 and 1<<n in your expression. It looks to me like those will, in some cases, not truncate.> Are there any known references on this?-- glen
Reply by ●October 11, 20142014-10-11
Dear Glen,
your observations are correct! I am trying to avoid any multiplication including the one with the multiplicative inverse (the reciprocal of the constant).
I have updated my claims for the algorithm. It works for any constant divisor (but I've only tested n=2,3,4,5,6,7,8, and given the following assumptions:
Also, the range for x is [0,2^(2*n)-2], i.e. for n = 6, the range indeed is
0:4094.
And a test program:
[code]
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
/* main:
*/
int main(void)
{
int qapprox, qexact;
int i, j, k;
for (i = 2; i < 8; i++) {
for (j = 1; j < (1<<i)*(1<<i)-1; j++) {
k = (1<<i) - 1;
qapprox = (((j>>i)+j+((1<<i)+1))>>i)-1;
qexact = j / k;
if (qapprox != qexact) {
fprintf(stderr, "qapprox = (%d/%d) = %d\tqexact = (%d/%d) = %d\n",
j, k, qapprox, j, k, qexact);
}
}
}
return 0;
}
[/code]
Best regards
Nikolaos Kavvadias
http://www.nkavvadias.com
Reply by ●October 11, 20142014-10-11
On Sat, 11 Oct 2014 00:43:30 -0700, Nikolaos Kavvadias wrote:> Dear all, > > I believe I have invented a formula for quotient calculation that works > for x/63 and x/127. > > The formula is as follows and it is divisionless and multiplierless: > > y = (((x>>n)+x+((1<<n)+1))>>n)-1; > > Use n=6 for 63, and n=7 for 127. 1<<n is the strength-reduced form for > calculating 2^n. > > It does not require any double-word arithmetic (like with multiplying by > multiplicative inverse and then performing adjustment steps). It does > not seem to work for other dividers, but a systematic fix may be > possible. > > Are there any known references on this? > > Best regards Nikolaos Kavvadias http://www.nkavvadias.com > http://github.com/nkkav/Without working through the math, I think that you're using a special case of the more general approximation for the reciprocal of x: 1/x ~ 1/x0 - (x0 - x)/x0^2 Do the math, and I think you'll find that embedded in the expression above. It's just the first two terms of the Taylor's expansion of 1/x, and works without multiplication because x/64 = x >> 6 (ignoring rounding) and because (64 - 63) = 1. -- www.wescottdesign.com
Reply by ●October 11, 20142014-10-11
Hi Tim,> > The formula is as follows and it is divisionless and multiplierless: > > y = (((x>>n)+x+(1<<n)+1)>>n)-1;> Without working through the math, I think that you're using a special > case of the more general approximation for the reciprocal of x: > 1/x ~ 1/x0 - (x0 - x)/x0^2Great observation! Indeed, I get the asymptotic behavior that would be typical for a Taylor expansion; when away from x0, the errors start to escalate since the approximation diverges. Still it would be interesting to see if there is any known use of this hack, beforehand my claim. It's dated May 16, 2011 in my notes, but no other details survive, apart from remembering that I used sketching. Best regards Nikolaos Kavvadias http://www.nkavvadias.com> Do the math, and I think you'll find that embedded in the expression > above. It's just the first two terms of the Taylor's expansion of 1/x, > and works without multiplication because x/64 = x >> 6 (ignoring > rounding) and because (64 - 63) = 1. > -- > www.wescottdesign.com
Reply by ●October 11, 20142014-10-11
> >> Do the math, and I think you'll find that embedded in the expression >> above. It's just the first two terms of the Taylor's expansion of 1/x, >> and works without multiplication because x/64 = x >> 6 (ignoring >> rounding) and because (64 - 63) = 1. >> -- >> www.wescottdesign.com On 11.10.14 20:48, Nikolaos Kavvadias wrote:> Hi Tim, > >>> The formula is as follows and it is divisionless and multiplierless: >>> y = (((x>>n)+x+(1<<n)+1)>>n)-1; > >> Without working through the math, I think that you're using a special >> case of the more general approximation for the reciprocal of x: >> 1/x ~ 1/x0 - (x0 - x)/x0^2 > > Great observation! > > Indeed, I get the asymptotic behavior that would be typical for a Taylor expansion; when away from x0, the errors start to escalate since the approximation diverges. > > Still it would be interesting to see if there is any known use of this hack, beforehand my claim. It's dated May 16, 2011 in my notes, but no other details survive, apart from remembering that I used sketching. > > Best regards > Nikolaos KavvadiasI used a similar method in a floating-point math package I wrote in the mid-70's. The difference is that, to get acceptable accuracy, I had to use the second-order term too. -- Tauno Voipio
Reply by ●October 11, 20142014-10-11
Am 11.10.2014 um 19:48 schrieb Nikolaos Kavvadias:> Still it would be interesting to see if there is any known use of > this hack, beforehand my claim. It's dated May 16, 2011 in my notes, > but no other details survive, apart from remembering that I used > sketching.Sorry to burst your bubble, but that optimization technique has been around sind effectively forever; it may well have been in the world since before you were born. A quick search reveals papers at least as far back as 1976. GCC has had an automatic peep-hole optimizer for stuff like this since about 1994.
Reply by ●October 11, 20142014-10-11
> Sorry to burst your bubble, but that optimization technique has been > around sind effectively forever;Bubble == totally burst. However I still don't have a concrete example that the specific expression has been in use. Any specific reference is much appreciated!> it may well have been in the world > since before you were born. A quick search reveals papers at least as > far back as 1976.Since I was born in 1977, this is quite plausible :)> GCC has had an automatic peep-hole optimizer for stuff like this since > about 1994.Most probably this is based on Torbjorn's work. I understand that the GCC peephole optimizer works at the RTL; I don't know if there is such GIMPLE pass. These optimizers are low-level essentially; one might want to apply the technique even as a source-level optimization: http://github.com/nkkav/kdiv/ Sections 9.7 and 9.8: http://www.nkavvadias.com/doc/hlo/hlo-README.html Eventually it is always possible to derive both a divisionless and multiplierless hack; use HD's magic number (multiplicative inverse) algorithm and then optimize the constant multiplication with the fixed-point magic number. It is great in its genericity, will work for any integer constant divisor, but this formula will perform better for divisors for the form 2^n-1, if you are only interested in small ranges of x. Best regards Nikolaos Kavvadias http://www.nkavvadias.com
Reply by ●October 11, 20142014-10-11
Am 11.10.2014 um 21:51 schrieb Nikolaos Kavvadias:> Bubble == totally burst. However I still don't have a concrete > example that the specific expression has been in use. Any specific > reference is much appreciated!>> GCC has had an automatic peep-hole optimizer for stuff like this >> since about 1994. > > Most probably this is based on Torbjorn's work.Not just based on. The code in GCC _is_ Torbjoern Granlund's work. He wrote a paper about what he did, and got the code included into GCC. It doesn't generally result in exactly the instruction sequence your example in the OP layed out. But then, that particular sequence wouldn't actually be faster than a straight multiply by 65 on platforms with an at least somewhat decent HW multiplier. And it works only for a rather limited range of inputs, which makes it lose a lot of its appeal.
Reply by ●October 11, 20142014-10-11
Hi,> >> GCC has had an automatic peep-hole optimizer for stuff like this > >> since about 1994. > > Most probably this is based on Torbjorn's work. > > Not just based on. The code in GCC _is_ Torbjoern Granlund's work. He > wrote a paper about what he did, and got the code included into GCC.Indeed, it is his work incl. the GCC implementation and it is true that both design and implementation have been described in his 1994 paper. Good catch on the umlaut; sometimes I don't transliterate \"{o} as oe but as o.> It doesn't generally result in exactly the instruction sequence your > example in the OP layed out. But then, that particular sequence > wouldn't actually be faster than a straight multiply by 65 on platforms > with an at least somewhat decent HW multiplier. And it works only for a > rather limited range of inputs, which makes it lose a lot of its appeal.I think the proposed expression can build very fast hardware given that you have the freedom to design it, e.g. on an ASIC or FPGA. The critical path is 1 ADD (scaled addition of x with itself) followed by 1 constant ADD by some wiring and 1 decrement. This all looks like a customized addition structure that if optimized will have the critical path of a single adder plus two stages. Essentially it is four-operand addition with two variable and two constant inputs. The range of inputs is indeed limited, outside these limits it is just an approximation (and not a good one). Best regards Nikolaos Kavvadias http://www.nkavvadias.com







