# 128-bits multiply with NumPy

## 128-bits multiply

For another on-going project, I needed to do 64-bits x 64-bits to 128-bits multiply, and I needed it in Python. When doing numerical work with Python, I always use NumPy which is a really awesome library. Unfortunately NumPy is heavily biased towards hardware native types, which makes sense for efficiency but did not helped me.

I decided to write my own NumPy-based 64x64->128 routines.

## Some simple maths

The problem with 64x64 using 64-bits types is that it will obviously overflow. To avoid overflowing, we need to decompose the multiply operation into several operations applying to 64-bits types (which are the bigger native types we can use) without overflow. The trick is to multiply and add 32-bits integers using 64-bits integers: that way we know there will be no overflow.

Here is how to decompose a 64x64 multiplication into 32-bits operations:

```
a * b
= ahi|alo * bhi|blo
= (ahi*2^n + alo) * (bhi*2^n + blo)
= (ahi*bhi)*2^(2n) + (ahi*blo)*2^n + (alo*bhi)*2^n + alo*blo
= (ahi*bhi)*2^(2n) + (ahi*blo + alo*bhi)*2^n + alo*blo
```

With:

`a`

the 1st 64-bits integer,`ahi`

its 32 higher order bits,`alo`

its 32 lower order bits`b`

the 2nd 64-bits integer,`bhi`

its 32 higher order bits,`blo`

its 32 lower order bits`n`

the bit size of the smaller type. In my case n = 32

That's it? Unfortunately no: the terms `ahi*bi`

, `ahi*blo`

, `alo*bhi`

and `alo*blo`

won't overflow (32x32->64), but the term `ahi*blo + alo*bhi`

can (and will) overflow. We have to decompose it again:

```
ahi*blo + alo*bhi
= mid1 + mid2
= mid1hi|mid1lo + mid2hi|mid2lo
= (mid1hi*2^n + mid1lo) + (mid2hi*2^n + mid2lo)
= (mid1hi+mid2hi)*2^n + mid1lo + mid2lo
```

Looks good.
The last issue to deal with is that this middle term `ahi*blo + alo*bhi`

is a 64-bits value shifted by 32-bits: its 32 low-order bits must be added to the last term `alo*blo`

whereas its 32 high-order bits must be added to the 1st term `ahi*bhi`

.

## NumPy implementation

The NumPy implementation just implement what we saw here. As a side-note, I think the fact you can indifferently use NumPy arrays or integers is a testament to the great interface design of NumPy.

The actual implementation looks a little bit obfuscated, because of optimizations to avoid too many array copies. You can compare it to the more straightforward (and slower) version.

## But why?

OK, I could just have used NumPy with `array(dtype='object')`

and use the nice built-in, arbitrary sized, integer Python type.
This is true.

On the other hand, this is interesting and this is faster. And in the case of my other project, I can optimize further because I do not need the whole 128-bits result.