cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Hecmaytr
Observer
Observer
226 Views
Registered: ‎08-05-2020

FPGA bitstream execution time much slower than HLS estimate

Hi,

I am using U280 and Vitis 2020.2 to synthesize a very simple program into hardware. Here is the program I used. It is just a loop nest reading from off-chip memory.

==============================

int temp[10]; 

for (int x = 0; x < 1800; x++) {

  for (int y = 10; y < 10; y++) {

    temp[y] += input[y][x];

  }

}

==============================

 

Both temp and input are partitioned into 10 pieces, so the outer loop can be pipelined with II=1. The HLS latency estimate looks good to me. It has total cycle estimate of around 2000, and total latency estimate is around 6.569 us. 

 

However, when I ran the bitstream on the real hardware, the measured runtime is around 0.3ms which is 50x slower compared with the HLS estimate. The runtime is extracted from Vitis profiler. (the profiling hardware will indeed introduce some overhead, but I guess that is not the reason).

I know the real off-chip DDR/HBM access may introduce some overhead, but that is still much beyond my expectation. Any thoughts how to debug this performance issue?

0 Kudos
2 Replies
maxdz8
Adventurer
Adventurer
153 Views
Registered: ‎01-08-2018

Let's see we have a synthectic benchmark on a trivial amount of data. AFAIK the HLS estimate counts the clocks from function start to "function able start again" and "result produced", those are the best possible.  The Vitis profiler (as from this?) will include all the communication and dispatch overhead as far as I understand. Let me tell you being just 50x slower is pretty much of a dream to me, a consumer 3D card from 2012 would be 1000x slower for such work.

By the way, I don't know how you get to dismiss "DDR/HBM access may introduce some overhead". For such simple kernel memory bandwidth or even just latency will be the limiting factor, good thing all this does not need to get to external memory at all.

Are you sure you're measuring the same things?

As far as I can tell there's certainly nothing to "debug" here but I would encourage you in profiling real work. Writing meaningful benchmarks (especially synthetic) is something needing super special care.

0 Kudos
Hecmaytr
Observer
Observer
127 Views
Registered: ‎08-05-2020

I guess I need to provide more details on my concerns in case that you think this is a super dumb question.

 

1. I am measuring the "Kernel Execution Time" from Vitis profiler, which only includes the execution time and does not include any communication or dispatch overhead. Those other overheads you mentioned are listed separately in Vitis Profiler. So in other words, the HLS estimate should be close to the "Kernel Execution Time" if there is not any runtime stall (e.g., memory or FIFO stalls).

2. I agree that DDR/HBM access latency is indeed much slower than on-chip access. But the access latency for each memory transaction is usually no more than 150 cycles. If we have frequency = 300 MHz, in the very worst case if all the accesses are sequentially processed, the total memory access time should be no more than 150 * 3.3 * (1800/16) ~= 56us with burst_length=16. but there is still a gap.

3. Another key information I want to provide, which should be where the problem lies: I tried to remove the function body and only insert a single memory write statement like `output[0] = 1;` and measured the kernel execution time again. the measured time is 0.19ms, while the runtime for our synthetic application is 0.26ms. Again, the measured time does not include another runtime overhead. So an almost empty application has already taken 0.19ms. this is not what I expected. 

 

0 Kudos