I am designing a tiled matrix multiplication accelerator on a ZYNQ XC7Z010 using SDSoC. The following shows my hardware function prototype with all its pragmas.
#pragma SDS data sys_port(in0:ACP, in1:ACP)
#pragma SDS data sys_port(res:ACP)
#pragma SDS data zero_copy(in0[0:size*size], in1[0:size*size])
#pragma SDS data zero_copy(res[0:size*size])
void matmul_block(int x, int y, int z, int size, const int* in0, const int* in1, int* res);
Everything works as expected when I use the above pragmas to compute the tiled multiplication entirely on the hardware side and I get a result in less than 20 seconds (as measured by the XTime utility).
However, as soon as I switch to HP on any one of the three ports, the global execution time reported by XTime jumps up to about an hour (around 3900 seconds for my last experiment). Here is the only line I changed in the entire program:
- #pragma SDS data sys_port(in0:ACP, in1:ACP)
+ #pragma SDS data sys_port(in0:AFI, in1:AFI)
The block design generated by SDSoC and processed by Vivado (see the attached PDF file) seems to contain the right IPs and interconnections: two AXI connections to HP0 and HP1 for in1 and res, as well as a connection to ACP for res.
Does anyone have a clue why the execution time of a simple matrix multiplication operation would be multiplied by nearly 200 only by using different AXI interfaces?