128-bits multiply with NumPy

For another on-going project, I needed to do 64-bits x 64-bits to 128-bits multiply, and I needed it in Python. When doing numerical work with Python, I always use NumPy which is a really awesome library. Unfortunately NumPy is heavily biased towards hardware native types, which makes sense for efficiency but did not helped me.

I decided to write my own NumPy-based 64x64->128 routines.

Some simple maths

The problem with 64x64 using 64-bits types is that it will obviously overflow. To avoid overflowing, we need to decompose the multiply operation into several operations applying to 64-bits types (which are the bigger native types we can use) without overflow. The trick is to multiply and add 32-bits integers using 64-bits integers: that way we know there will be no overflow.

Here is how to decompose a 64x64 multiplication into 32-bits operations:

  a * b
= ahi|alo * bhi|blo
= (ahi*2^n + alo) * (bhi*2^n + blo)
= (ahi*bhi)*2^(2n) + (ahi*blo)*2^n + (alo*bhi)*2^n + alo*blo
= (ahi*bhi)*2^(2n) + (ahi*blo + alo*bhi)*2^n + alo*blo

With:

  • a the 1st 64-bits integer, ahi its 32 higher order bits, alo its 32 lower order bits
  • b the 2nd 64-bits integer, bhi its 32 higher order bits, blo its 32 lower order bits
  • n the bit size of the smaller type. In my case n = 32

That's it? Unfortunately no: the terms ahi*bi, ahi*blo, alo*bhi and alo*blo won't overflow (32x32->64), but the term ahi*blo + alo*bhi can (and will) overflow. We have to decompose it again:

  ahi*blo + alo*bhi
= mid1 + mid2
= mid1hi|mid1lo + mid2hi|mid2lo
= (mid1hi*2^n + mid1lo) + (mid2hi*2^n + mid2lo)
= (mid1hi+mid2hi)*2^n + mid1lo + mid2lo

Looks good. The last issue to deal with is that this middle term ahi*blo + alo*bhi is a 64-bits value shifted by 32-bits: its 32 low-order bits must be added to the last term alo*blo whereas its 32 high-order bits must be added to the 1st term ahi*bhi.

NumPy implementation

The NumPy implementation just implement what we saw here. As a side-note, I think the fact you can indifferently use NumPy arrays or integers is a testament to the great interface design of NumPy.

The actual implementation looks a little bit obfuscated, because of optimizations to avoid too many array copies. You can compare it to the more straightforward (and slower) version.

But why?

OK, I could just have used NumPy with array(dtype='object') and use the nice built-in, arbitrary sized, integer Python type. This is true.

On the other hand, this is interesting and this is faster. And in the case of my other project, I can optimize further because I do not need the whole 128-bits result.