07-10-2017 09:40 PM
I have a stream processing problem and I want to implement it using the dataflow pipe.
The problem basically includes two stages. The first stage filters out some of the input with a condition. For example,
it selects only the data that is larger than 10 for processing in next stage. The second stages can start processing whenever there is data generated in the first stage. However, the length of the output stream of the first stage is unknown.
Without determined stream length, I don't know how to put the second stage logic into a dataflow pipe manner.
While I think the problem is suitable for a pipelined stream processing, so any suggestions for optimizing such a problem in SDAccel?
07-28-2017 10:36 AM - edited 07-28-2017 11:08 AM
Hi Liucheng,
The done_stream return empty after the the value of done is fetch out from done_stream.
So the value of done_stream.empty() will fall back to '0'.
That's the reason why the last few data are not written back to memory.
Try this code:
if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE) && (done==1))){
for(int i = 0; i < count; i++){
#pragma HLS pipeline
out[idx + i] = buffer[i];
}
idx += count;
count = 0;
}
07-10-2017 11:05 PM
Hi @liucheng
If you opt to right your kernel in C/C++ form you can do something like this,
hls::stream<uint32_dt> inStream; hls::stream<uint32_dt> outStream; // Useful in your case hls::stream<uint1_dt> done_stream; // Dataflow pragma comes here read(in_buf,inStream); compute(inStream, outStream, done_stream); write(outStream, done_stream, out_buf);
Read Module :
In read module you just copy data from DDR to inStream
Compute Module :
In compute module make use of the data from inStream and do your computation -- Here output may grow or shrink
By the end of completion of this unit
void compute(hls::stream<uint32_dt> &inStream, hls::stream<uint32_dt> &outStream, hls::stream<uint1_dt> &done_compute ) { // Do processing using inStream and write output into outStream // Refere Github examples on dataflow streams // End of your compute module update done_compute stream done_compute << (uint1_dt) 1; }
Write Module :
In write module you can handle it this way
int value = 0;
// outStream.empty() : This function keeps reading until the entire output is exhausted
// Value : Check for completion of compute module operation
// If stream is empty and completion of compute module is set this loop breaks
while (!(outStream.empty()) || value != 1) {
int read_stream = outStream.read(); if(!(doneStream.empty())) value = doneStream.read();
}
Refer following example for basic dataflow/stream based implementation :
Hope this is useful.
Thanks
Kali
07-10-2017 11:29 PM
Hi @liucheng
If you plan to write your kernel in C/C++ form using dataflow & streams, you can do it as below
Stream Declaration :
hls::stream<uint32_dt> inStream; hls::stream<uint32_dt> outStream; hls::stream<uint1_dt> doneStream; read(in_buf, inStream); compute(inStream, outStream, doneStream, size); write(outStream, doneStream, out_buf);
Read Module :
Copy data from DDR to input stream
Compute Module :
void compute(inStream, outStream, doneStream, size) { // Do computation over input data which is read from inStream
// Write output to outStream
// Set done stream to notify other modules that compute module is finished execution doneStream << 1; }
Write Module :
void write(outStream, doneStream, outBuf) { int value = 0; // Run this loop till outStream is signaled as empty and compute module is finished execution while(!(outStream.empty()) || value != 1) { // Read data from outStream // Copy back to DDR // Read doneStream to check if compute module finished execution if(!(doneStream.empty())) value = donStream.read(); } }
Hope this helps
Thanks
Kali
07-14-2017 05:07 AM
Thanks for the suggestion.
I will try it for further optimization.
07-21-2017 08:53 PM - edited 07-21-2017 09:41 PM
Finally I came back to the hls version.
I had the baseline design work with a minor change on the while condition in the write process.
However, the performance is extremely bad. On alpha-data ADM-PCIE-7v3, the measured bandwidth is around
40MB/s ((read + write)/execution time). However, if there is no condition processing with a similar code, say I read in a stream, add a constant and write it back to out stream. The measured bandwidth is 1.4GB/s. So basically the write process is essentially writing the data one by one. It does no burst at all. Otherwise, the bandwidth should be at least 700MB/s.
To fix this problem, I plan to do batch writing. The basic idea is to buffer the output data and do a burst write when I get a specified number of data or it comes to the end of the writing process. However, I tired a number of ways to implement the logic but never succeed.
//==========================================
// Here is the code that works with bad performance
//==========================================
static void compute(
hls::stream<int> &in_stream,
hls::stream<char> &done_stream,
hls::stream<int> &im_stream,
int size)
{
filtering: for (int i = 0; i < size; i++){
#pragma HLS pipeline
#pragma HLS LOOP_TRIPCOUNT min=16*1024 max=16*1024*1024
int data = in_stream.read();
if(data > 10){
im_stream << data;
}
if(i == size - 1){
done_stream << 1;
}
}
}
static void write(
hls::stream<int> &im_stream,
hls::stream<char> &done_stream,
int *out
)
{
int idx = 0;
while(!(im_stream.empty() && done_stream.read() == 1)){
#pragma HLS pipeline
if(!(im_stream.empty())){
int data = im_stream.read();
out[idx] = data + 10;
idx++;
}
}
}
//===================================
// Here is the code that I tried to batch writing
//====================================
static void write(
hls::stream<int> &im_stream,
hls::stream<char> &done_stream,
hls::stream<char> &duplicated_done_stream,
int *out
)
{
int idx = 0;
int counter = 0;
int buffer[256];
while(!(done_stream.read() == 1 && im_stream.empty())){
if(!(im_stream.empty())){
int im = im_stream.read();
buffer[counter] = im + 10;
counter++;
if(counter == 256){
#pragma HLS pipeline
for(int i = 0; i < 256; i++){
out[idx] = buffer[i];
idx++;
}
counter = 0;
}
}
}
char done;
dulicated_done_stream.read_nb(done);
if(done){
// Process the last batch
#pragma HLS pipeline
for(int i = counter; i < 256; i++){
buffer[i] = -1;
}
#pragma HLS pipeline
for(int i = 0; i < 256; i++){
out[idx] = buffer[i];
idx++;
}
}
}
However, I get the following error at sw_emu mode, which I don't quite understand.
Not sure how to resolve it. Hope to get some suggestions on this as well.
ERROR: Kernel execution failed. Kernel exited with the code 11
ERROR: cu monitor died unexpectedly: cu monitor stopping while there are active cu events
terminate called without an active exception
ERROR: Kernel execution failed. Kernel exited with the code 6
../../..//utility/check.mk:59: recipe for target 'check_sw_emu_xilinx_adm-pcie-7v3_1ddr_3_0_check' failed
make: *** [check_sw_emu_xilinx_adm-pcie-7v3_1ddr_3_0_check] Error 1
07-23-2017 11:15 PM
Hi @liucheng
Please rewrite your write_module this way
// Declare BUFFER_SIZE as your batch size, batch size of 16 should infer burst writes
void write_module(unsigned int *output_buffer,
hls::stream<uint32_dt> &inStream,
hls::stream<uint1_dt> &compute_doneStream
){
unsigned int lcl_out_buffer[BUFFER_SIZE]; unsigned int global_idx = 0; unsigned int count = 0; uint1_dt done_compute = 0; while(!(inStream.empty()) || (done_compute != 1) ){ lcl_out_buffer[count++] = inStream.read(); if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE))) { write_module: for(int j = 0; j < count; j++){
#pragma HLS PIPELINE
output_buffer[global_idx + j] = lcl_out_buffer[j];
} global_idx += count; count = 0; } if(!(compute_doneStream.empty())) done_compute = compute_doneStream.read(); } // End of while
}
Hope this helps
Thanks
Kali
07-25-2017 08:24 AM - edited 07-25-2017 09:13 PM
Hi, @kalib,
Thanks for your suggestions.
It functions correctly, but the performance is not improved.
Currently, I get only 22MB/s bandwidth.
I think it may be caused by the the if condition.
if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE)))
The write process happens before it collects BUFFER_SIZE elements. The result makes sense, but it is not what we are looking for.
To fix this, I changed the code to the following.
if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE) && (!(compute_doneStream.empty()))))
So basically the second part of the if condition is only true for the last write burst.
Unfortunately, the design does not work as expected.
The last few data are not written back to memory.
07-26-2017 04:11 AM
Hi @liucheng ,
Good to know that your write module generates correct output !
You didn't mention if you could also infer burst transfers for write module. Have you checked logs ?
Kindly check following items before tweaking working version further,
Please look into following section in our Github on-boarding example repository :
https://github.com/Xilinx/SDAccel_Examples/tree/master/getting_started/dataflow
Hope this helps,
Thanks
Kali
07-26-2017 10:46 PM
Hi, @kalib
Thanks for the suggestions.
Yes, I checked the log information.
The bad performance is caused by the fact that there is no burst data transmission in
either read_input() or write_back().
Basically, my design includes read_input()-->compute()-->write_back().
When I change the write_back implementation while leaving the rest untouched, the performance gets
worse because of the write process. Anyway, there is no burst using either if condition.
I checked the github dataflow example. Yes, when there is a simple vector addition, burst will be inferred automatically.
I don't quite understand why the same read_input function here fails to be synthesized as burst read.
I see there is int16 style optimization in the examples, but it is used in opencl not hls with stream data type.
07-27-2017 12:09 AM
Hi @liucheng ,
Liucheng : "The bad performance is caused by the fact that there is no burst data transmission in
either read_input() or write_back()."
Ideally your read module must infer burst transfer if you follow any of our Github examples.
Please share your source code and log files if it is not confidential.
Thanks
Kali
07-27-2017 03:33 AM - edited 07-27-2017 04:15 AM
Hi, @kalib
Sorry that I made a mistake.
Yes, both read_input() and write_back() have burst inferred according to the vivado_hls.log
The ii of the loops are equal to 1. Then I am completely lost and have no idea why the measured bandwidth is low.
Here is the code as well as the vivado_hls.log and report file.
I keep getting the following errors when uploading the source code.
"The attachment's host.cpp content type (text/x-c++src) does not match its file extension and has been removed."
So I have them packed in a tar file.
Thank you.
07-28-2017 10:36 AM - edited 07-28-2017 11:08 AM
Hi Liucheng,
The done_stream return empty after the the value of done is fetch out from done_stream.
So the value of done_stream.empty() will fall back to '0'.
That's the reason why the last few data are not written back to memory.
Try this code:
if((count == BUFFER_SIZE) || ((count != 0) && (count < BUFFER_SIZE) && (done==1))){
for(int i = 0; i < count; i++){
#pragma HLS pipeline
out[idx + i] = buffer[i];
}
idx += count;
count = 0;
}
07-29-2017 08:18 AM
Hi, @seanz
Thanks for the suggestion. I have tested it and it works great.
The measured bandwidth on ADM-PCIE-7v3 is 260MB/s.
With all the help from the forum, we find that by changing the order of the code slightly (put the done generation code in front of the while loop), the design can achieve around 350MB/s memory access bandwidth. Here is the code of the write_back function.
int idx = 0; int count = 0; uint1_dt done = 0; int buffer[BUFFER_SIZE]; while(!(out_stream.empty()) || (done != 1)){ buffer[count++] = out_stream.read(); // Putting the done_stream.read() here produces better performance if(!(done_stream.empty())){ done = done_stream.read(); } if((count == BUFFER_SIZE) || ((count > 0) && (count < BUFFER_SIZE) && done == 1)){ for(int i = 0; i < count; i++){ #pragma HLS pipeline out[idx + i] = buffer[i]; } idx += count; count = 0; } }
In addition, we can improve the bandwidth utilization by duplicating the whole kernels. With four kernels, we can get over 900MB/s memory bandwidth. We are also trying to optimize the design by using 512 bit data width, but there are still bugs. I will share the design when it works.
Regards,
Cheng Liu