For another on-going project, I needed to do 64-bits x 64-bits to 128-bits multiply, and I needed it in Python. When doing numerical work with Python, I always use NumPy which is a really awesome library. Unfortunately NumPy is heavily biased towards hardware native types, which makes sense for efficiency but did not helped me.
I decided to write my own NumPy-based 64x64->128 routines.
Some simple maths
The problem with 64x64 using 64-bits types is that it will obviously overflow. To avoid overflowing, we need to decompose the multiply operation into several operations applying to 64-bits types (which are the bigger native types we can use) without overflow. The trick is to multiply and add 32-bits integers using 64-bits integers: that way we know there will be no overflow.
Here is how to decompose a 64x64 multiplication into 32-bits operations:
a * b = ahi|alo * bhi|blo = (ahi*2^n + alo) * (bhi*2^n + blo) = (ahi*bhi)*2^(2n) + (ahi*blo)*2^n + (alo*bhi)*2^n + alo*blo = (ahi*bhi)*2^(2n) + (ahi*blo + alo*bhi)*2^n + alo*blo
athe 1st 64-bits integer,
ahiits 32 higher order bits,
aloits 32 lower order bits
bthe 2nd 64-bits integer,
bhiits 32 higher order bits,
bloits 32 lower order bits
nthe bit size of the smaller type. In my case n = 32
That's it? Unfortunately no: the terms
alo*blo won't overflow (32x32->64), but the term
ahi*blo + alo*bhi can (and will) overflow. We have to decompose it again:
ahi*blo + alo*bhi = mid1 + mid2 = mid1hi|mid1lo + mid2hi|mid2lo = (mid1hi*2^n + mid1lo) + (mid2hi*2^n + mid2lo) = (mid1hi+mid2hi)*2^n + mid1lo + mid2lo
The last issue to deal with is that this middle term
ahi*blo + alo*bhi is a 64-bits value shifted by 32-bits: its 32 low-order bits must be added to the last term
alo*blo whereas its 32 high-order bits must be added to the 1st term
The NumPy implementation just implement what we saw here. As a side-note, I think the fact you can indifferently use NumPy arrays or integers is a testament to the great interface design of NumPy.
The actual implementation looks a little bit obfuscated, because of optimizations to avoid too many array copies. You can compare it to the more straightforward (and slower) version.
OK, I could just have used NumPy with
array(dtype='object') and use the nice built-in, arbitrary sized, integer Python type.
This is true.
On the other hand, this is interesting and this is faster. And in the case of my other project, I can optimize further because I do not need the whole 128-bits result.