I recently came across an interesting problem at work: how to efficiently transpose a non-square matrix in-place?
One of my colleague is working on optimizing a six-steps FFT on our manycore processor, the MPPA-256. The six-steps FFT algorithm has a lot of nice properties, especially regarding highly parallel systems, but contrary to the classic four-steps FFT algorithm, 4 matrix transpositions are needed instead of 1. In our case, it also needs to be conservative about memory because each core process its own set of matrix and for efficiency all data must fit in local (fast) memory.
So an efficient in-place transpose operation was needed, and it happened it is not trivial.
Non-square matrix in-place transpose is HARD
Square matrix transpose is easy, but if it is not square there is no easy solution.
In our case the matrix dimensions are known at compile-time leading us to generate the permutations cycles, so that at runtime we can just follow those cycles to do the in-place transpose.
As it is an interesting problem, I decided to dig it further.
Permutations cycles generation
Let's look at how the cycles are generated. In the rest of the article, I will use the following conventions:
- the matrix is called A
- it is a NxM (N rows, M columns) matrix
- matrix are stored in row-major (as in C or Python)
- we want to transpose
A[N][M]
toA[M][N]
If we consider A[N][M]
as a flat array A[N*M]
, we can compute the new index b
of an element A[a]
:
A[a] == A[n][m] <=> n = a / M and m = a % M
A[n][m] should go to A[m][n]
A[m][n] == A[b] <=> b = (a % M) * N + a / M
When we move A[a]
to A[b]
, we then need to move A[b]
to its own new place, etc.
This a kind of recursive operation, corresponding to the following example in Python:
def move(A, a, val):
b = (a % M) * N + a / M
tmp = A[b]
A[b] = val
move(A, b, tmp)
And this must be repeated until we loop back on an element we already moved, because unfortunately it won't go through the whole matrix at once: we will have to process several independent cycles. Once the cycle has been completed, we need to process another cycle starting from an element which has not been moved yet, and repeat this operation until we no longer have any non-moved elements.
There are several way to keep track of whether an element was moved or not, in this simple Python
example I am using a bytearray
to mark each visited element, and I print the cycles while I
go through them:
N = 7 # Example for a 7x2 matrix
M = 2
A = bytearray(N*M)
try:
while True:
a = A.index('\x00') # look for the 1st non-visited
# element left
# A.index('\x00') will raise a
# 'ValueError' exception when
# no '\x00' element are found,
# exiting the infinite loop
while A[a] != 1: # loop as long as the
# cycle continues
print a, # print current position
A[a] = 1 # mark visited
a = (a % M) * N + a / M # move to next position
print
except ValueError: # done: A.index('\x00') did not found
# anymore '\x00' elements
pass
This gives us:
0
1 7 10 5 9 11 12 6 3 8 4 2
13
For this example we have three cycles, two are 1-element long and the other one is 12-elements long.
If we save the information that we have two 1-element cycles starting at index 0 and 13 and one 12-elements cycle
starting at index 1, we have enough information to transpose any 7x2
matrix.
Putting it all together
Our initial problem is transposing multiple 256x2
matrix in an algorithm implemented in C.
Based on the previous cycle generation algorithm, I wrote a more complex Python script to
directly generated the corresponding C-code and you can find it on GitHub.
It also includes other goodies, such as comparisons against other standard optimization techniques that we can apply to the out-of-place case.
You might argue that storing cycles instead of an intermediate buffer does not necessarily bring us memory savings, but the truth is that this information is much smaller than the matrix itself and it can also be shared between multiple cores allowing to further amortize the needed memory.
If we look at our case, a 256x2
, float
matrix for a group of 16 cores (we have 16 groups of 16 cores in the MPPA-256 - hence 256 cores - and we can efficiently share memory inside each group), doing out-of-place transpose would consume 2 * 16 * 256 * 2 * 4 = 64kB
.
If we do it in place, we need 58 + 16 * 256 * 2 * 4 = 32kB
: this is a 50% size reduction!
For this kind of operation, memory speed is the bottleneck and keeping as much data as we can in local (fast) memory is a premium.