For another on-going project, I needed to do 64-bits x 64-bits to 128-bits multiply, and I needed it in Python. When doing numerical work with Python, I always use NumPy which is a really awesome library. Unfortunately NumPy is heavily biased towards hardware native types, which makes sense for efficiency but did not helped me.
I decided to write my own NumPy-based 64x64->128 routines.
Some simple maths
The problem with 64x64 using 64-bits types is that it will obviously overflow. To avoid overflowing, we need to decompose the multiply operation into several operations applying to 64-bits types (which are the bigger native types we can use) without overflow. The trick is to multiply and add 32-bits integers using 64-bits integers: that way we know there will be no overflow.
Here is how to decompose a 64x64 multiplication into 32-bits operations:
a * b
= ahi|alo * bhi|blo
= (ahi*2^n + alo) * (bhi*2^n + blo)
= (ahi*bhi)*2^(2n) + (ahi*blo)*2^n + (alo*bhi)*2^n + alo*blo
= (ahi*bhi)*2^(2n) + (ahi*blo + alo*bhi)*2^n + alo*blo
With:
a
the 1st 64-bits integer,ahi
its 32 higher order bits,alo
its 32 lower order bitsb
the 2nd 64-bits integer,bhi
its 32 higher order bits,blo
its 32 lower order bitsn
the bit size of the smaller type. In my case n = 32
That's it? Unfortunately no: the terms ahi*bi
, ahi*blo
, alo*bhi
and alo*blo
won't overflow (32x32->64), but the term ahi*blo + alo*bhi
can (and will) overflow. We have to decompose it again:
ahi*blo + alo*bhi
= mid1 + mid2
= mid1hi|mid1lo + mid2hi|mid2lo
= (mid1hi*2^n + mid1lo) + (mid2hi*2^n + mid2lo)
= (mid1hi+mid2hi)*2^n + mid1lo + mid2lo
Looks good.
The last issue to deal with is that this middle term ahi*blo + alo*bhi
is a 64-bits value shifted by 32-bits: its 32 low-order bits must be added to the last term alo*blo
whereas its 32 high-order bits must be added to the 1st term ahi*bhi
.
NumPy implementation
The NumPy implementation just implement what we saw here. As a side-note, I think the fact you can indifferently use NumPy arrays or integers is a testament to the great interface design of NumPy.
The actual implementation looks a little bit obfuscated, because of optimizations to avoid too many array copies. You can compare it to the more straightforward (and slower) version.
But why?
OK, I could just have used NumPy with array(dtype='object')
and use the nice built-in, arbitrary sized, integer Python type.
This is true.
On the other hand, this is interesting and this is faster. And in the case of my other project, I can optimize further because I do not need the whole 128-bits result.