I recently came across an interesting problem at work: how to efficiently transpose a non-square matrix in-place?
One of my colleague is working on optimizing a six-steps FFT on our manycore processor, the MPPA-256. The six-steps FFT algorithm has a lot of nice properties, especially regarding highly parallel systems, but …