UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Visitor brugger
Visitor
14,226 Views
Registered: ‎08-21-2013

HLS Compiler Bug? Streaming Based Design

Jump to solution

Hi,

 

I wrote a streaming based design in Vivado HLS 2013.2. It passes the testbench in C-Simulation, however it fails the testbench when doing RTL Co-simulation.

 

The design has an input and an output stream. When called it consumes 10 values from the input streams and then replicates these 10 inputs fife times for the output stream. The code can be downloaded at the bottowm.

 

 stream_issue_hls.cpp - defines the streaming function

#include "stream_issue_hls.hpp"

void stream_issue(hls::stream<uint32_t> &input, 
        hls::stream<uint32_t> &output)
{
    hls::stream<uint32_t> fifo_10;
    #pragma HLS stream depth=10 variable=fifo_10

    uint32_t cnt = 0;
    for (int i = 0; i < 50; ++i) {
        #pragma HLS PIPELINE II=1

        uint32_t index;
        if (cnt < 10)
            index = input.read();
        else
            index = fifo_10.read();

        output.write(index);
        fifo_10.write(index);

        if (cnt < 10)
            ++cnt;
    }
}

 

 stream_issue_hls.hpp - header file

#ifndef __STREAM_ISSUE_HLS_HPP__
#define __STREAM_ISSUE_HLS_HPP__

#include <ap_int.h>
#include <hls_stream.h>

void stream_issue(hls::stream<uint32_t> &input, 
        hls::stream<uint32_t> &output);

#endif

 

 stream_issue_tb.cpp - testbench

#include "stream_issue_hls.hpp"

#include <iostream>

int main() {
    // setup input data
    hls::stream<uint32_t> input;
    for (int i = 0; i < 10; ++i) {
        input.write(i);
    }
    hls::stream<uint32_t> output;

    // run kernel
    stream_issue(input, output);

    // verify output
    bool error = false;
    for (int i = 0; i < 50; ++i) {
        uint32_t hw = output.read();
        uint32_t sw = i % 10;
        if (hw != sw) {
            std::cout << "ERROR: " << i << " - " << hw 
                    << " != "<< sw << std::endl;
            error = true;
        }
    }
    if (output.size() != 0) {
        std::cout << "ERROR: out.size() = " << 
                output.size() << " !=  0"<< std::endl;
        error = true;
    }
    std::cout << "done." << std::endl;
    return error;
}

 

C-Simulation output:

WARNING: Hls::stream 'hls::stream<unsigned int>.3' contains leftover data, which may result in RTL simulation hanging.
done.

 

SystemC Co-simulation output:

@I [SIM-11] Starting SystemC simulation ...

             SystemC 2.2.0 --- Jun 13 2013 15:35:04
        Copyright (c) 1996-2006 by all Contributors
                    ALL RIGHTS RESERVED
Note: VCD trace timescale unit is set by user to 1.000000e-12 sec.
SystemC: simulation stopped by user.
@I [SIM-316] Starting C post checking ...
ERROR: 10 - 9 != 0
ERROR: 11 - 9 != 1
ERROR: 12 - 9 != 2
...
ERROR: 47 - 9 != 7
ERROR: 48 - 9 != 8
done.
@E [SIM-303] C TB post check failed, nonzero return value '1'
. @E [SIM-4] *** C/RTL co-simulation finished: FAIL ***

 

Why does the RTL compiler not understand what I want to do? Can somebody reproduce this issue? Is this a compiler bug? Or am I simply using the language wrongly? I am completely lost here.

 

Best regards,

Christian

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
Highlighted
Moderator
Moderator
21,137 Views
Registered: ‎04-17-2011

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution

Please change your coding style in stream_issue_hls.cpp as :

 

for (int i = 0; i < 50; ++i) {
#pragma HLS PIPELINE II=1

uint32_t buffer[10];

if (i < 10) {
// captures the first 10
buffer [i] = input.read();
}

output.write(buffer[i%10]);

}

 

using an Array. 

Attached is the code for your refenence.

Regards,
Debraj
----------------------------------------------------------------------------------------------
Kindly note- Please mark the Answer as "Accept as solution" if information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.
----------------------------------------------------------------------------------------------
9 Replies
Moderator
Moderator
14,179 Views
Registered: ‎04-17-2011

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution
Thanks for the archive. Have copied your file and checking.
Regards,
Debraj
----------------------------------------------------------------------------------------------
Kindly note- Please mark the Answer as "Accept as solution" if information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.
----------------------------------------------------------------------------------------------
0 Kudos
Highlighted
Moderator
Moderator
21,138 Views
Registered: ‎04-17-2011

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution

Please change your coding style in stream_issue_hls.cpp as :

 

for (int i = 0; i < 50; ++i) {
#pragma HLS PIPELINE II=1

uint32_t buffer[10];

if (i < 10) {
// captures the first 10
buffer [i] = input.read();
}

output.write(buffer[i%10]);

}

 

using an Array. 

Attached is the code for your refenence.

Regards,
Debraj
----------------------------------------------------------------------------------------------
Kindly note- Please mark the Answer as "Accept as solution" if information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.
----------------------------------------------------------------------------------------------
Visitor brugger
Visitor
14,157 Views
Registered: ‎08-21-2013

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution

Thanks for looking into this, Debraj.

 

Do I understand you correctly, using internal hls:streams is not supported for now and a workaround is to use arrays? In that case we would have come to the same conclusion.

 

I would really favour streams here, because:

  • they have a very intuitive interface
  • you don't have to maintain read and write indeces

Using arrays has some challenges:

  • It is hard to achieve good pipelining
    • e.g. your design has an InitiationInterval of 2 instead of the requested 1
    • Here it can be easily fixed with: #pragma HLS dependence variable=buffer false
  • You have to be vary carfull with index handling
    • a simple buffer[i % 10] as proposed by you produces a modulo operator "urem" with 92 registers and 75 LUTs
    • Changing the design to an outer and inner loop reduces the total usage from 119 to 32 registers and from 110 to 54 LUTs
    • When reading and writing conditionally you cannot use for loops, you have to maintain the index on your own. This clutters the code and is prone to errors.
  • Using only one index with read after write logic for the BRAM is very bad for timing.
    • In our example here it is not an issue, as the read data is not written back into the buffer. 
    • In the case of a loop it was necessary for me to use a second counter and introducing a register for buffering. Much more cluttering.

 

All these details are nicely handled by hls::stream. Can you please make sure that this will be fixed for the next Vivado HLS version?

 

Best regards,

Christian

 

PS: I am using arrays for now, however it makes the code harder to understand. I am currently considering writing my own class that wrapps all the index handling for arrays with similar interfaces as hls::stream.

0 Kudos
Xilinx Employee
Xilinx Employee
14,147 Views
Registered: ‎08-17-2011

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution

Hello Christian,

 

a few notes:

 

1-

in this code, VHLS may not always optimize array reads / writes as working out that when I is smaller than 10, then there are no issues isn't trivial:

if (i < 10) {
   // captures the first 10
   buffer [i] = input.read();
}

output.write(buffer[i%10]);

 

I'd suggest further changes to the code like something along those lines:

if (i < 10) {
   // captures the first 10

   int t = input.read();
   buffer [i] = t;

   output.write(t);

} else { // EDITED 6th September: added else { because there is an output.write() call above - without else then 2 writes would be performed in sequence - this is incorrect.

output.write(buffer[i%10]);

} // EDITED 6th September: added closing }

 

With the above, you should be able to *not* use the dependence directive.

 

(for anyone reading this : using the directive dependence = false is to be used with caution! ALWAYS)

 

2-

the reminder operand can also be recoded with something like this, as the original index is sequential - that's actually what you would do when writing RTL.

ap_uint<4> buf_index = 0;

...

buf_index = (9==buf_index) ? 0 : buf_index + 1;

 

3-

I'm not sure on the root cause but there's an issue somewhere (your code / VHLS?) .. the example shipped with VHLS 2013.2 works... C:\Xilinx\Vivado_HLS\2013.2\examples\design\hls_stream

 

If you look into the RTL generated (eg syn\vhdl\stream_issue.vhd), the instantiated fifo “fifo_10_V_fifo_10_V_fifo_U” has the input port if_din connected to a signal stuck to 0.

 

 

We will have someone look into this further.

 

The previous workaround (or any other and their variations) can be used.

 

 

I hope this helps!

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
Xilinx Employee
Xilinx Employee
14,145 Views
Registered: ‎08-17-2011

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution
Actually this issue 3-"input port if_din connected to a signal stuck to 0" is gone with the internal builds for the next release.
No further action required from me :)
- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
Moderator
Moderator
14,123 Views
Registered: ‎04-17-2011

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution
Do you have any further questions? Hope the details message from Herve was helpful. Feel free to post your comments.
Regards,
Debraj
----------------------------------------------------------------------------------------------
Kindly note- Please mark the Answer as "Accept as solution" if information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.
----------------------------------------------------------------------------------------------
0 Kudos
Visitor brugger
Visitor
14,115 Views
Registered: ‎08-21-2013

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution

Dear herver, debrajr,

 

I am very happy to hear that this will be fixed in the upcomming release.

 

Thank you for your note 2 and 3. Note 2 helped me to simplify my code and 3 will help me to verify that a second bug is caused by the same issue.

 

1-

Regarding note 1, I am still not fully convinced, that it is always possible to avoid the dependency directive. I am aware that using this pragma is very dangerous for pipelined designs. How would you remove it in the following design? Without the pragma, it has an initiation interval of 2 instead of the requested 1. When added, it works fine, even in SystemC/Verilog/VHDL Co-Simulation. (source files attached below)

 

void stream_issue(hls::stream<uint32_t> &input, hls::stream<uint32_t> &output)
{
	uint32_t buffer[10];
	// Uncomment the following line to meet initiation interval of 1
//	#pragma HLS dependence variable=buffer false
	ap_uint<4> buf_index = 0;
	for (int i = 0; i < 50; ++i) {
		#pragma HLS PIPELINE II=1
		int t;
		if (i < 10) {
			// captures the first 10
			t = input.read();
			buffer[buf_index] = t;
		} else {
			// increment by one
			//WARNING: if the following transformation takes more than 10 cycles
			//  this design won't work. To test this, replace it with
			//  "t = exp(buffer[buf_index]);" which takes 23 cycles.
			t = buffer[buf_index] + 1;
			buffer[buf_index] = t;
		}
		output.write(t);
		buf_index = (buf_index == 9) ? 0 : buf_index + 1;
	}
}

 

2-

I now see that arrays are also pipelined if necessary. I am impressed, I didn't expected this. Very nice!

 

That means when my original design failed timing (during place & route) it was probably an error in the timing estimation of Vivado HLS. I remeber HLS didn't reported a timing violation. I assume that is why it put to much logic into one timing slot. Is there some simple way to introduce a pipeline stage into a path? I looked at the pragmas, but couln't find anything. In my case I did it manually, by pipelining the read with a new variable that is assigned in the last loop iteration. (read[buf_index] = complex_xor(t_delay); t_delay = read[buf_index_next]; ++buf_index; ++buf_index_next; ...wrap_index...)

 

Regards,

Christian

 

 

0 Kudos
Xilinx Employee
Xilinx Employee
14,081 Views
Registered: ‎08-17-2011

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution

Hello Christian,

 

I haven't looked into your new zip, so my comments are to take with a pinch of salt :)

 

for your 1-,

I'm not sure about the behavior you describe since those 2 lines

t = buffer[buf_index] + 1;

buffer[buf_index] = t;

imply a read-modify-write on buffer[buf_index] location (probably implemented with BRAM): this is a latency of 2 so II == 2 seems logical -> so the dependence set to false may trigger the array access pipelining that you're talking about ? if cosim still works then i believe the results are correct - until proven otherwise.

 

also be aware that the directive will change the scheduling / resources used.

something you may also look into is the partionning of the buffer into registers (directive partion -dim = 0) so that there is no dependency.

 

for your 2-

there's a reg class template described in UG902 (should be in the includes header files, somewhere too) that will do exactly this: add a register on the signal that you want, and this will have the effect of changing the generated RTL as the register adds latency- i'm sure you'll appreciate that this is constraining more the design and the tool's freedom : you may end-up with suboptimal results (so again, use with caution otherwise you may not like the results).

 

if you can confirm more around the topic on 1- that would be nice and good learning :)

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Visitor brugger
Visitor
14,077 Views
Registered: ‎08-21-2013

Re: HLS Compiler Bug? Streaming Based Design

Jump to solution

Hello herver,

 

1-

In my example the array is mapped to a BRAM. BRAMs have up to 2 independent (read/write) ports (ug473), that means HLS can schedule a read and write for the same cycle.

Let us look at the lines:

t = buffer[buf_index] + 1;

buffer[buf_index] = t;


I think you are right, the two lines will probably have a latency of 1 or 2 cycles. However they can be fully pipelined with an initiation intervall (II) of 1. The timing could for a 2 cycle  look latency like:

--- 1 cycle ---

r_1 = buffer.read(0)

 

--- 2 cycle ---

r_2 = buffer.read(1)
t_2 = r_1 + 1

 

--- 3 cycle ---
r_3 = buffer.read(2)
t_3 = r_2 + 1

buffer.write(0, t_2)

 

--- 4 cycle ---

r_4 = buffer.read(3)
t_4 = r_3 + 1

buffer.write(1, t_3)

...

At the end it will wrap. So reading is always ahead of writing in the same clock cycle by 2 addresses. And as mentioned before the read and write to different addresses can be schedules, as we have two memory ports. When our buffer has the size of 10, there is no carry data dependency as long as the read is not ahead of the write by more than 9 clock cycles. However HLS is not detecting this, so we have to force him to just don't care and make sure that this is the case our self. (see the exp example abover for what happens when this is violated)

 

Maybe the HLS tool could be improved to analyze address dependent array depenencies. A similar analysis as it might be done for (-dim = 0), I haven't checked if the problem exists there. But I explicitely don't want to map my array to registers design, as the buffer has about 512 elements with 32 bits.

 

2-

I found the register in the section "User Defined Registers in C++" on page 253. Thanks for the pointer, that's much less destruvtive than my current solution. I will have a look at it in my next timing issue.

 

Regards,

Christian