UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Observer kritika117
Observer
298 Views
Registered: ‎03-07-2018

How to get complete latency for mentioned code?

Jump to solution

Hello,

Below is my HLS code for 32*32 matrix multiplication. I am trying to send result 'C' to DDR memory through m_axi interface.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void mmult (volatile float * ddr)
{
#pragma HLS INTERFACE m_axi depth=1024 port=ddr
#pragma HLS INTERFACE ap_none port=return

float Abuf[32][32], Bbuf[32][32];
#pragma HLS array_partition variable=Abuf block factor=16 dim=2
#pragma HLS array_partition variable=Bbuf block factor=16 dim=1
float C[1024];

for(int i=0; i<32; i++) {
for(int j=0; j<32; j++) {
#pragma HLS PIPELINE
Abuf[i][j] = i+j;
Bbuf[i][j] = i*j;
}
}

for (int i = 0; i < 32; i++) {
for (int j = 0; j < 32; j++) {
#pragma HLS PIPELINE
float result = 0;
for (int k = 0; k < 32; k++) {
float term = Abuf[i][k] * Bbuf[k][j];
result += term;
}
C[i * 32+ j] = result;
}
}
memcpy((float*)(ddr), C, 1024*sizeof(float));
}

 

When I synthesis this code in HLS this gives following report:

================================================================
== Vivado HLS Report for 'mmult'
================================================================
* Date: Fri Apr 5 17:54:00 2019

* Version: 2017.4 (Build 2086221 on Fri Dec 15 21:13:33 MST 2017)
* Project: test_s1
* Solution: solution1
* Product family: kintex7
* Target device: xc7k325tffg900-2


================================================================
== Performance Estimates
================================================================
+ Timing (ns):
* Summary:
+--------+-------+----------+------------+
| Clock | Target| Estimated| Uncertainty|
+--------+-------+----------+------------+
|ap_clk | 10.00| 8.75| 1.25|
+--------+-------+----------+------------+

+ Latency (clock cycles):
* Summary:
+------+------+------+------+---------+
| Latency | Interval | Pipeline|
| min | max | min | max | Type |
+------+------+------+------+---------+
| 1031| 1031| 1031| 1031| none |
+------+------+------+------+---------+

+ Detail:
* Instance:
N/A

* Loop:
+--------------------+------+------+----------+-----------+-----------+------+----------+
| | Latency | Iteration| Initiation Interval | Trip | |
| Loop Name | min | max | Latency | achieved | target | Count| Pipelined|
+--------------------+------+------+----------+-----------+-----------+------+----------+
|- memcpy.ddr.C.gep | 1025| 1025| 3| 1| 1| 1024| yes |
+--------------------+------+------+----------+-----------+-----------+------+----------+

 

================================================================
== Utilization Estimates
================================================================
* Summary:
+-----------------+---------+-------+--------+--------+
| Name | BRAM_18K| DSP48E| FF | LUT |
+-----------------+---------+-------+--------+--------+
|DSP | -| -| -| -|
|Expression | -| -| 0| 63|
|FIFO | -| -| -| -|
|Instance | 2| -| 512| 580|
|Memory | 2| -| 0| 0|
|Multiplexer | -| -| -| 105|
|Register | -| -| 57| -|
+-----------------+---------+-------+--------+--------+
|Total | 4| 0| 569| 748|
+-----------------+---------+-------+--------+--------+
|Available | 890| 840| 407600| 203800|
+-----------------+---------+-------+--------+--------+
|Utilization (%) | ~0 | 0| ~0 | ~0 |
+-----------------+---------+-------+--------+--------+

I have checked in analysis and co-simulation, this 1031 latency is only for the memcpy. Means, this synthesis report is only showing the clock cycles for transfer of C[1024] to ddr port and not it's calculation. How to check the complete latency (calculation time plus the transfer time)? What is missing in my code??

0 Kudos
1 Solution

Accepted Solutions
Scholar u4223374
Scholar
263 Views
Registered: ‎04-26-2015

Re: How to get complete latency for mentioned code?

Jump to solution

The fact that it's using zero DSP slices should give you a clue. Your "calculation" doesn't actually read from the input at all - so it can be computed during synthesis and have the result hard-coded. That's exactly what HLS is doing, so all that's left in your function is the memcpy that writes the pre-computed data to RAM.

 

If you modify your code to do something useful (eg. read two matrices from RAM, multiply them, write them back to RAM) then you'll get a sensible result.

5 Replies
Scholar u4223374
Scholar
264 Views
Registered: ‎04-26-2015

Re: How to get complete latency for mentioned code?

Jump to solution

The fact that it's using zero DSP slices should give you a clue. Your "calculation" doesn't actually read from the input at all - so it can be computed during synthesis and have the result hard-coded. That's exactly what HLS is doing, so all that's left in your function is the memcpy that writes the pre-computed data to RAM.

 

If you modify your code to do something useful (eg. read two matrices from RAM, multiply them, write them back to RAM) then you'll get a sensible result.

Observer kritika117
Observer
257 Views
Registered: ‎03-07-2018

Re: How to get complete latency for mentioned code?

Jump to solution

Thanks @u4223374 

You are correct. But is it not possible to give input values from HLS code itself and not from the RAM?

0 Kudos
Scholar u4223374
Scholar
255 Views
Registered: ‎04-26-2015

Re: How to get complete latency for mentioned code?

Jump to solution

@kritika117 How are you planning to get the values into the HLS block in the first place? There are lots of options - AXI Streams coming from DMAs (or other streaming blocks), external BRAM connections, raw data ports (not advisable for this as it'll be 64,000+ wires coming from the block), AXI Lite slaves talking to internal BRAM, or AXI Masters to read from external RAM.

 

The only thing you can't do is hard-code the values within the HLS block, because that's pointless (if values are hard-coded, HLS will just compute the result during synthesis).

0 Kudos
Observer kritika117
Observer
251 Views
Registered: ‎03-07-2018

Re: How to get complete latency for mentioned code?

Jump to solution

@u4223374 

Yes, I want to initialize input values in HLS code only. Is it possible?

0 Kudos
Scholar u4223374
Scholar
236 Views
Registered: ‎04-26-2015

Re: How to get complete latency for mentioned code?

Jump to solution

If you want to initialize the results in the HLS code, then HLS is already doing exactly the right thing - pre-computing all the answers (so it doesn't have build a floating-point IP core) and just writing them out to RAM.

0 Kudos