07-19-2017 05:32 PM
I have a matrix that is about 10k by 10k in size, and it is transferred from the computer to the PS-side DDR. What would be the most efficient way to transpose it?
I believe DMA would not be needed, because we can not utilize the burst feature. Is there any other tools that allows fast data transfer between different DDR memory locations?
07-19-2017 05:53 PM - edited 07-19-2017 05:54 PM
Do you need to transpose it? Because transposition is very costly while a frequent operation, most matrix libraries would just set a flag to indicate that the matrix is transposed - relative to their position in memory.
When it comes to using the matrix, say matrix vs vector multiplication, two algos are used to multiply it - one for the normal matrix and one for the transposed. Same for element access etc.
This way you avoid entirely a costly operation with the tradeoff of a few more bytes of code.
That said, You could create a component in HLS to perform this operation and connect it through a full AXI interface and connect it to the slave HP port on the ZYNQ.
07-20-2017 09:20 AM
Right now, I am using the data mover IP on the PL side to read and write to PS DDR memory through the High Performance AXI connection. I can get a decent data transfer rate by using a 128 bit bus width with a large burst size. However, If I have to read the transposed matrix out, I can only read one element at a time, then increment the DDR address and read the next one, this would slow down the data transfer rate dramatically.
I am wondering if there is any better tools that I am not aware of for this kind of operation?
07-20-2017 09:33 AM
07-20-2017 02:33 PM
Thanks for your input.
I am using the ZCU102 evaluation board, and the PL-DDR is not large enough, also MIG takes quite bit of space in the PL.
I did tested using ARM for transpose, and it was faster than I expected. But I am not sure if it is fast enough for my application.
There are 4 HP ports, would using all four of them give me a higher data rate?
07-20-2017 03:00 PM
Yes, I think so.
Have a look at this SDK performance manual
Chapter 7 "Evaluating High-Performance Ports" shows many statistics about using all ports simultaneously
This video is also a good overview