## 128-bits multiply with NumPy

For another on-going project, I needed to do 64-bits x 64-bits to 128-bits multiply, and I needed it in Python. When doing numerical work with Python, I always use NumPy which is a really awesome library. Unfortunately NumPy is heavily biased towards hardware native types, which makes sense for efficiency but did not helped me.

I decided to write my own NumPy-based 64x64->128 routines.

## Some simple maths

The problem with 64x64 using 64-bits types is that it will obviously overflow. To avoid overflowing, we need to decompose the multiply operation into several operations applying to 64-bits types (which are the bigger native types we can use) without overflow. The trick is to multiply and add 32-bits integers using 64-bits integers: that way we know there will be no overflow.

Here is how to decompose a 64x64 multiplication into 32-bits operations:

```  a * b
= ahi|alo * bhi|blo
= (ahi*2^n + alo) * (bhi*2^n + blo)
= (ahi*bhi)*2^(2n) + (ahi*blo)*2^n + (alo*bhi)*2^n + alo*blo
= (ahi*bhi)*2^(2n) + (ahi*blo + alo*bhi)*2^n + alo*blo
```

With:

• `a` the 1st 64-bits integer, `ahi` its 32 higher order bits, `alo` its 32 lower order bits
• `b` the 2nd 64-bits integer, `bhi` its 32 higher order bits, `blo` its 32 lower order bits
• `n` the bit size of the smaller type. In my case n = 32

That's it? Unfortunately no: the terms `ahi*bi`, `ahi*blo`, `alo*bhi` and `alo*blo` won't overflow (32x32->64), but the term `ahi*blo + alo*bhi` can (and will) overflow. We have to decompose it again:

```  ahi*blo + alo*bhi
= mid1 + mid2
= mid1hi|mid1lo + mid2hi|mid2lo
= (mid1hi*2^n + mid1lo) + (mid2hi*2^n + mid2lo)
= (mid1hi+mid2hi)*2^n + mid1lo + mid2lo
```

Looks good. The last issue to deal with is that this middle term `ahi*blo + alo*bhi` is a 64-bits value shifted by 32-bits: its 32 low-order bits must be added to the last term `alo*blo` whereas its 32 high-order bits must be added to the 1st term `ahi*bhi`.

## NumPy implementation

The NumPy implementation just implement what we saw here. As a side-note, I think the fact you can indifferently use NumPy arrays or integers is a testament to the great interface design of NumPy.

The actual implementation looks a little bit obfuscated, because of optimizations to avoid too many array copies. You can compare it to the more straightforward (and slower) version.

## But why?

OK, I could just have used NumPy with `array(dtype='object')` and use the nice built-in, arbitrary sized, integer Python type. This is true.

On the other hand, this is interesting and this is faster. And in the case of my other project, I can optimize further because I do not need the whole 128-bits result.