UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
379 Views
Registered: ‎03-22-2017

Zero latency for sub-function invocation

Is it possible to have no-latency for sub-function invocations? In other terms, when a sub-function is synthesized a handshake interface is generated that takes 1 clock cycle: how can I remove that without inlining the function?

0 Kudos
4 Replies
Scholar u4223374
Scholar
323 Views
Registered: ‎04-26-2015

Re: Zero latency for sub-function invocation

I don't think you can. Why not just inline the function? That serves much the same purpose in HLS as it does in regular C - it removes the function call overhead.

Explorer
Explorer
317 Views
Registered: ‎03-22-2017

Re: Zero latency for sub-function invocation

@u4223374,

 

I like to have

- HLS optimizing locally each single function/module (without moving around resources)

- well defined interfaces betweens modules for debugging purposes

What do you think?

Tags (2)
0 Kudos
Contributor
Contributor
205 Views
Registered: ‎03-13-2017

Re: Zero latency for sub-function invocation

@gdg wrote:

Is it possible to have no-latency for sub-function invocations? 

HLS translates a function in entity (VHDL speaking) and function invocations in "entity instantiation". So a ZERO-latency function invocation corresponds to a combinatorial entity instantiation, with and handshaking interface without clock according with pragmas specified.

@gdg wrote:

.... In other terms, when a sub-function is synthesized a handshake interface is generated that takes 1 clock cycle: how can I remove that without inlining the function?

I don't see any formal reason which would force 1 clock interface, but - as it sometimes happens - HLS could get confused due the user C-style and synthesize something different from what expected.
The gold rule (for me) is "to build from bottom to up" that means begin with a simple example, synthesize, once correct add new pieces.

As proof, I just tried the HLS example "hier_func": from "Vivado HLS Welcome Page" click on "Open Example Project" and follow "Select an example/Coding Style Examples/hier_func". Synthesis and co-sim results by means of ver2018.2 follow.

#include "hier_func.h"
void sumsub_func(din_t *in1, din_t *in2, dint_t *outSum, dint_t *outSub)
{
#pragma HLS INLINE off
    *outSum = *in1 + *in2;
    *outSub = *in1 - *in2;
}

void shift_func(dint_t *in1, dint_t *in2, dout_t *outA, dout_t *outB)
{
#pragma HLS INLINE off
    *outA = *in1 >> 1;
    *outB = *in2 >> 2;
}
void hier_func(din_t A, din_t B, dout_t *C, dout_t *D)
{
    dint_t apb, amb;
    sumsub_func(&A,&B,&apb,&amb);
    shift_func(&apb,&amb,C,D);
} 
================================================================
== Vivado HLS Report for 'hier_func'
================================================================
* Date:           Wed Jan 16 17:25:37 2019
* Version:        2018.2 (Build 2258646 on Thu Jun 14 20:25:20 MDT 2018)
* Project:        proj_hier_func
* Solution:       solution1_NO_INLINE
* Product family: kintex7
* Target device:  xc7k160tfbg484-1
================================================================
== Performance Estimates
================================================================
+ Timing (ns): 
    * Summary: 
    +--------+-------+----------+------------+
    |  Clock | Target| Estimated| Uncertainty|
    +--------+-------+----------+------------+
    |ap_clk  |   4.00|     1.785|        0.50|
    +--------+-------+----------+------------+

+ Latency (clock cycles): 
    * Summary: 
    +-----+-----+-----+-----+---------+
    |  Latency  |  Interval | Pipeline|
    | min | max | min | max |   Type  |
    +-----+-----+-----+-----+---------+
    |    0|    0|    0|    0|   none  |
    +-----+-----+-----+-----+---------+

    + Detail: 
        * Instance: 
        +----------------------------+-------------+-----+-----+-----+-----+---------+
        |                            |             |  Latency  |  Interval | Pipeline|
        |          Instance          |    Module   | min | max | min | max |   Type  |
        +----------------------------+-------------+-----+-----+-----+-----+---------+
        |call_ret_sumsub_func_fu_48  |sumsub_func  |    0|    0|    0|    0|   none  |
        |call_ret8_shift_func_fu_56  |shift_func   |    0|    0|    0|    0|   none  |
        +----------------------------+-------------+-----+-----+-----+-----+---------+
        * Loop: 
        N/A
================================================================
== Utilization Estimates
================================================================
* Summary: 
+-----------------+---------+-------+--------+--------+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  |
+-----------------+---------+-------+--------+--------+
|DSP              |        -|      -|       -|       -|
|Expression       |        -|      -|       -|       -|
|FIFO             |        -|      -|       -|       -|
|Instance         |        -|      -|       0|      78|
|Memory           |        -|      -|       -|       -|
|Multiplexer      |        -|      -|       -|       -|
|Register         |        -|      -|       -|       -|
+-----------------+---------+-------+--------+--------+
|Total            |        0|      0|       0|      78|
+-----------------+---------+-------+--------+--------+
|Available        |      650|    600|  202800|  101400|
+-----------------+---------+-------+--------+--------+
|Utilization (%)  |        0|      0|       0|   ~0   |
+-----------------+---------+-------+--------+--------+

+ Detail: 
    * Instance: 
    +----------------------------+-------------+---------+-------+---+----+
    |          Instance          |    Module   | BRAM_18K| DSP48E| FF| LUT|
    +----------------------------+-------------+---------+-------+---+----+
    |call_ret8_shift_func_fu_56  |shift_func   |        0|      0|  0|   0|
    |call_ret_sumsub_func_fu_48  |sumsub_func  |        0|      0|  0|  78|
    +----------------------------+-------------+---------+-------+---+----+
    |Total                       |             |        0|      0|  0|  78|
    +----------------------------+-------------+---------+-------+---+----+
...
================================================================
== Interface
================================================================
* Summary: 
+----------+-----+-----+------------+--------------+--------------+
| RTL Ports| Dir | Bits|  Protocol  | Source Object|    C Type    |
+----------+-----+-----+------------+--------------+--------------+
|ap_start  |  in |    1| ap_ctrl_hs |   hier_func  | return value |
|ap_done   | out |    1| ap_ctrl_hs |   hier_func  | return value |
|ap_idle   | out |    1| ap_ctrl_hs |   hier_func  | return value |
|ap_ready  | out |    1| ap_ctrl_hs |   hier_func  | return value |
|A         |  in |   32|   ap_none  |       A      |    scalar    |
|B         |  in |   32|   ap_none  |       B      |    scalar    |
|C         | out |   32|   ap_vld   |       C      |    pointer   |
|C_ap_vld  | out |    1|   ap_vld   |       C      |    pointer   |
|D         | out |   32|   ap_vld   |       D      |    pointer   |
|D_ap_vld  | out |    1|   ap_vld   |       D      |    pointer   |
+----------+-----+-----+------------+--------------+--------------+

 hier_func_CO-SIM.pngCO-SIMThe 4 boxes are the A,B input and C, D out text files.image.pngSynthesis of RTL (with hierarchy)

 image.pngSynthesis of RTL (without hierarchy)

 

 

 

Furthermore the pragma "function instantiate" could help. From manual page: "It creates a unique RTL implementation for each instance of  a function, allowing each instance to be optimized"

Hope this help.

 

Scholar u4223374
Scholar
179 Views
Registered: ‎04-26-2015

Re: Zero latency for sub-function invocation

@gdg I agree, but that's the tradeoff you have to make - if you want proper interfaces between functions, that adds resources and time compared to inlining everything. I tend to design functions to either be very simple (so it makes sense to inline them) or very complex (so the function call overhead is insignificant compared to the function runtime).