UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
3,611 Views
Registered: ‎06-17-2012

pipe with undetermined length

Jump to solution

I have a stream processing problem and I want to implement it using the dataflow pipe.

The problem basically includes two stages. The first stage filters out some of the input with a condition. For example,

it selects only the data that is larger than 10 for processing in next stage. The second stages can start processing whenever there is data generated in the first stage. However, the length of the output stream of the first stage is unknown.

Without determined stream length, I don't know how to put the second stage logic into a dataflow pipe manner.

 

While I think the problem is suitable for a pipelined stream processing, so any suggestions for optimizing such a problem in SDAccel? 

 

0 Kudos
1 Solution

Accepted Solutions
Moderator
Moderator
4,652 Views
Registered: ‎03-27-2012

Re: pipe with undetermined length

Jump to solution

Hi Liucheng,

 

The done_stream return empty after the the value of done is fetch out from done_stream.

So the value of done_stream.empty() will fall back to '0'.

That's the reason why the last few data are not written back to memory.

 

Try this code:

 if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE) && (done==1))){
for(int i = 0; i < count; i++){
#pragma HLS pipeline
out[idx + i] = buffer[i];
}
idx += count;
count = 0;
}

 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

View solution in original post

12 Replies
Xilinx Employee
Xilinx Employee
2,896 Views
Registered: ‎01-12-2017

Re: pipe with undetermined length

Jump to solution

Hi @liucheng

 

If you opt to right your kernel in C/C++ form you can do something like this,

 

 

hls::stream<uint32_dt> inStream;
hls::stream<uint32_dt> outStream;
 
// Useful in your case
hls::stream<uint1_dt> done_stream;
 
// Dataflow pragma comes here
read(in_buf,inStream);
compute(inStream, outStream, done_stream);
write(outStream, done_stream, out_buf);

 

Read Module : 

 

In read module you just copy data from DDR to inStream

 

 

Compute Module : 

 

In compute module make use of the data from inStream and do your computation -- Here output may grow or shrink

By the end of completion of this unit 

 

void compute(hls::stream<uint32_dt> &inStream,
                       hls::stream<uint32_dt> &outStream,
                       hls::stream<uint1_dt>  &done_compute
                     ) {

       // Do processing using inStream and write output into outStream
       // Refere Github examples on dataflow streams
     

       // End of your compute module update done_compute stream
       done_compute << (uint1_dt) 1;

}

 

Write Module :  

 

In write module you can handle it this way

 

 

int value = 0;

// outStream.empty() : This function keeps reading until the entire output is exhausted
// Value : Check for completion of compute module operation
// If stream is empty and completion of compute module is set this loop breaks
while (!(outStream.empty()) ||  value != 1) {
int read_stream = outStream.read();   if(!(doneStream.empty()))     value = doneStream.read();
}

 

Refer following example for basic dataflow/stream based implementation :

 

https://github.com/Xilinx/SDAccel_Examples/blob/master/getting_started/dataflow/dataflow_stream_c/src/adder.cpp

 

Hope this is useful.

 

Thanks 

Kali

 

0 Kudos
Observer kaliuday
Observer
3,560 Views
Registered: ‎04-02-2015

Re: pipe with undetermined length

Jump to solution

Hi @liucheng

 

If you plan to write your kernel in C/C++ form using dataflow & streams, you can do it as below

 

Stream Declaration : 

 

 

hls::stream<uint32_dt> inStream;
hls::stream<uint32_dt> outStream;
hls::stream<uint1_dt>   doneStream;

read(in_buf, inStream);
compute(inStream, outStream, doneStream, size);
write(outStream, doneStream, out_buf);

 

Read Module : 

 

Copy data from DDR to input stream

 

Compute Module :

 

void compute(inStream, outStream, doneStream, size) {

      // Do computation over input data which is read from inStream
  
// Write output to outStream
// Set done stream to notify other modules that compute module is finished execution doneStream << 1; }

Write Module :

 

void write(outStream, doneStream, outBuf) {
      int value = 0;
      // Run this loop till outStream is signaled as empty and compute module is finished execution
      while(!(outStream.empty()) || value != 1) {
           // Read data from outStream
           // Copy back to DDR

           // Read doneStream to check if compute module finished execution
           if(!(doneStream.empty()))
                value = donStream.read();
      }
}

Hope this helps

 

Thanks

Kali

 

0 Kudos
Explorer
Explorer
3,470 Views
Registered: ‎06-17-2012

Re: pipe with undetermined length

Jump to solution

Thanks for the suggestion.

I will try it for further optimization.

 

0 Kudos
Explorer
Explorer
3,383 Views
Registered: ‎06-17-2012

Re: pipe with undetermined length

Jump to solution

Finally I came back to the hls version.

I had the baseline design work with a minor change on the while condition in the write process.

However, the performance is extremely bad. On alpha-data ADM-PCIE-7v3, the measured bandwidth is around 

40MB/s ((read + write)/execution time). However, if there is no condition processing with a similar code, say I read in a stream, add a constant and write it back to out stream. The measured bandwidth is 1.4GB/s. So basically the write process is essentially writing the data one by one. It does no burst at all. Otherwise, the bandwidth should be at least 700MB/s. 

 

To fix this problem, I plan to do batch writing. The basic idea is to buffer the output data and do a burst write when I get a specified number of data or it comes to the end of the writing process. However, I tired a number of ways to implement the logic but never succeed. 

//==========================================

// Here is the code that works with bad performance

//==========================================

static void compute(
hls::stream<int> &in_stream,
hls::stream<char> &done_stream,
hls::stream<int> &im_stream,
int size)
{
filtering: for (int i = 0; i < size; i++){
#pragma HLS pipeline
#pragma HLS LOOP_TRIPCOUNT min=16*1024 max=16*1024*1024
int data = in_stream.read();
if(data > 10){
    im_stream << data;
}

 

if(i == size - 1){
    done_stream << 1;
}
}
}

 

static void write(
hls::stream<int> &im_stream,
hls::stream<char> &done_stream,
int *out
)
{
int idx = 0;
while(!(im_stream.empty() && done_stream.read() == 1)){
#pragma HLS pipeline
    if(!(im_stream.empty())){
        int data = im_stream.read();
        out[idx] = data + 10;
        idx++;
}
}
}

 

//===================================

// Here is the code that I tried to batch writing

//====================================

static void write(
hls::stream<int> &im_stream,
hls::stream<char> &done_stream,

hls::stream<char> &duplicated_done_stream,
int *out
)
{
int idx = 0;
int counter = 0;
int buffer[256];

while(!(done_stream.read() == 1 && im_stream.empty())){
    if(!(im_stream.empty())){
        int im = im_stream.read();
        buffer[counter] = im + 10;
        counter++;
        if(counter == 256){
#pragma HLS pipeline
            for(int i = 0; i < 256; i++){
                out[idx] = buffer[i];
                idx++;
            }
            counter = 0;
        }
}
}

char done;
dulicated_done_stream.read_nb(done);
if(done){
// Process the last batch

#pragma HLS pipeline
for(int i = counter; i < 256; i++){
   buffer[i] = -1;
}

#pragma HLS pipeline
for(int i = 0; i < 256; i++){
    out[idx] = buffer[i];
    idx++;
}
}
}

 

However, I get the following error at sw_emu mode, which I don't quite understand.

Not sure how to resolve it. Hope to get some suggestions on this as well.

 

ERROR: Kernel execution failed. Kernel exited with the code 11
ERROR: cu monitor died unexpectedly: cu monitor stopping while there are active cu events
terminate called without an active exception
ERROR: Kernel execution failed. Kernel exited with the code 6
../../..//utility/check.mk:59: recipe for target 'check_sw_emu_xilinx_adm-pcie-7v3_1ddr_3_0_check' failed
make: *** [check_sw_emu_xilinx_adm-pcie-7v3_1ddr_3_0_check] Error 1

0 Kudos
Xilinx Employee
Xilinx Employee
3,349 Views
Registered: ‎01-12-2017

Re: pipe with undetermined length

Jump to solution

Hi @liucheng

 

Please rewrite your write_module this way

 

// Declare BUFFER_SIZE as your batch size, batch size of 16 should infer burst writes

void write_module(unsigned int *output_buffer,
hls::stream<uint32_dt> &inStream,
hls::stream<uint1_dt> &compute_doneStream
){
unsigned int lcl_out_buffer[BUFFER_SIZE]; unsigned int global_idx = 0; unsigned int count = 0; uint1_dt done_compute = 0; while(!(inStream.empty()) || (done_compute != 1) ){ lcl_out_buffer[count++] = inStream.read(); if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE))) { write_module: for(int j = 0; j < count; j++){
#pragma HLS PIPELINE
output_buffer[global_idx + j] = lcl_out_buffer[j];
} global_idx += count; count = 0; } if(!(compute_doneStream.empty())) done_compute = compute_doneStream.read(); } // End of while
}

Hope this helps

 

Thanks

Kali

 

0 Kudos
Explorer
Explorer
3,322 Views
Registered: ‎06-17-2012

Re: pipe with undetermined length

Jump to solution

Hi, @kalib

 

Thanks for your suggestions.

It functions correctly, but the performance is not improved.

Currently, I get only 22MB/s bandwidth.

I think it may be caused by the the if condition.

if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE))) 

The write process happens before it collects BUFFER_SIZE elements. The result makes sense, but it is not what we are looking for.

 

To fix this, I changed the code to the following.

if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE) && (!(compute_doneStream.empty())))) 

 So basically the second part of the if condition is only true for the last write burst.

Unfortunately, the design does not work as expected.

The last few data are not written back to memory.

 

 

0 Kudos
Xilinx Employee
Xilinx Employee
3,287 Views
Registered: ‎01-12-2017

Re: pipe with undetermined length

Jump to solution

Hi @liucheng ,

 

Good to know that your write module generates correct output !

 

You didn't mention if you could also infer burst transfers for write module. Have you checked logs ?

 

Kindly check following items before tweaking working version further, 

 

  • Read, Compute and Write module must achieve II = 1
  • Infer burst transfers between kernel and global memory (read/write modules)
  • Try to find right combination of stream depth and local buffer sizes for read/write modules.

 

Please look into following section in our Github on-boarding example repository : 

https://github.com/Xilinx/SDAccel_Examples/tree/master/getting_started/dataflow

 

Hope this helps,

 

Thanks

Kali

0 Kudos
Explorer
Explorer
3,242 Views
Registered: ‎06-17-2012

Re: pipe with undetermined length

Jump to solution

Hi, @kalib

 

Thanks for the suggestions.

Yes, I checked the log information. 

The bad performance is caused by the fact that there is no burst data transmission in

either read_input() or write_back().

 

Basically, my design includes read_input()-->compute()-->write_back().

When I change the write_back implementation while leaving the rest untouched, the performance gets 

worse because of the write process. Anyway, there is no burst using either if condition.

 

I checked the github dataflow example. Yes, when there is a simple vector addition, burst will be inferred automatically.

I don't quite understand why the same read_input function here fails to be synthesized as burst read.

 

I see there is int16 style optimization in the examples, but it is used in opencl not hls with stream data type.

 

0 Kudos
Xilinx Employee
Xilinx Employee
3,231 Views
Registered: ‎01-12-2017

Re: pipe with undetermined length

Jump to solution

Hi @liucheng ,

 

Liucheng : "The bad performance is caused by the fact that there is no burst data transmission in

either read_input() or write_back()."

 

Ideally your read module must infer burst transfer if you follow any of our Github examples.

 

Please share your source code and log files if it is not confidential.

 

Thanks

Kali

0 Kudos
Explorer
Explorer
2,376 Views
Registered: ‎06-17-2012

Re: pipe with undetermined length

Jump to solution

Hi, @kalib

 

Sorry that I made a mistake.

Yes, both read_input() and write_back() have burst inferred according to the vivado_hls.log

The ii of the loops are equal to 1. Then I am completely lost and have no idea why the measured bandwidth is low.

 

Here is the code as well as the vivado_hls.log and report file.

 

I keep getting the following errors when uploading the source code.

"The attachment's host.cpp content type (text/x-c++src) does not match its file extension and has been removed."

So I have them packed in a tar file. 

 

Thank you.

0 Kudos
Moderator
Moderator
4,653 Views
Registered: ‎03-27-2012

Re: pipe with undetermined length

Jump to solution

Hi Liucheng,

 

The done_stream return empty after the the value of done is fetch out from done_stream.

So the value of done_stream.empty() will fall back to '0'.

That's the reason why the last few data are not written back to memory.

 

Try this code:

 if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE) && (done==1))){
for(int i = 0; i < count; i++){
#pragma HLS pipeline
out[idx + i] = buffer[i];
}
idx += count;
count = 0;
}

 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

View solution in original post

Explorer
Explorer
2,181 Views
Registered: ‎06-17-2012

Re: pipe with undetermined length

Jump to solution

Hi, @seanz

 

Thanks for the suggestion. I have tested it and it works great.

The measured bandwidth on ADM-PCIE-7v3 is 260MB/s.

 

With all the help from the forum, we find that by changing the order of the code slightly (put the done generation code in front of the while loop), the design can achieve around 350MB/s memory access bandwidth. Here is the code of the write_back function.

    int idx = 0;
    int count = 0;
    uint1_dt done = 0;
    int buffer[BUFFER_SIZE];

    while(!(out_stream.empty()) || (done != 1)){
        buffer[count++] = out_stream.read();

       // Putting the done_stream.read() here produces better performance 
        if(!(done_stream.empty())){
            done = done_stream.read();
        }

        if((count == BUFFER_SIZE) || ((count > 0) && (count < BUFFER_SIZE) && done == 1)){
            for(int i = 0; i < count; i++){
            #pragma HLS pipeline
                out[idx + i] = buffer[i];
            }
            idx += count;
            count = 0;
        }

    }

In addition, we can improve the bandwidth utilization by duplicating the whole kernels. With four kernels, we can get over 900MB/s memory bandwidth. We are also trying to optimize the design by using 512 bit data width, but there are still bugs. I will share the design when it works.

 

Regards,

Cheng Liu

0 Kudos