UPGRADE YOUR BROWSER
We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!
07-31-2017 06:40 PM - edited 07-31-2017 06:43 PM
I have a stream design and works fine.
In order to further improve the bandwidth utilization, I tried to
extend the design with wider data width. The extended design
is all right in sw_emu mode, but fails to complete in hw_emu and hw mode
which seems to be an infinite loop.
In hw_mode, it shows that the kernel fails to get data from gmem. And I got the following
information. Also I checked the timeline file, but it is empty.
INFO: [SDx-EM 22] [Wall clock time: 00:21, Emulation time: 12.12 ms] Data transfer between kernel(s) and global memory(s)
BANK0 RD = 0.000 KB WR = 0.000 KB
The stream design includes three function read_input(), compute() and write_back(). Both the read_input() and write_back() has been used in similar design. Really appreciated for any suggestions on this problem.
The kernel code is attached here.
//Includes #include <hls_stream.h> #include <ap_int.h> #include <stdio.h> typedef ap_uint<1> uint1_dt; typedef ap_int<512> int512_dt; #define BUFFER_SIZE 128 #define WORD_NUM 16 //# of integer in a wide data static void read_input( int512_dt *in, hls::stream<int512_dt> &in_stream, int len) { seq_depth: for (int i = 0; i < len; i++){ #pragma HLS pipeline in_stream << in[i]; } } static void compute( hls::stream<int512_dt> &in_stream, hls::stream<uint1_dt> &done_stream, hls::stream<int> &out_stream, int len) { int512_dt data; int item; compute: for (int i = 0; i < len; i++){ #pragma HLS pipeline data = in_stream.read(); for(int j = 0; j < WORD_NUM; j++){ item = data.range((j+1)*32-1, j*32); if(item > 10){ out_stream << (item + 10); } if((i == len - 1) && (j == WORD_NUM -1)){ done_stream << 1; } } } } static void write_back( hls::stream<int> &out_stream, hls::stream<uint1_dt> &done_stream, int *out ) { int idx = 0; int count = 0; uint1_dt done = 0; uint1_dt done_empty = 0; uint1_dt stream_empty = 0; int buffer[BUFFER_SIZE]; while((stream_empty != 1) || (done != 1)){ stream_empty = out_stream.empty(); done_empty = done_stream.empty(); if(stream_empty != 1){ buffer[count++] = out_stream.read(); } if(done_empty != 1){ done = done_stream.read(); } if((count == BUFFER_SIZE) || ((count > 0) && (count < BUFFER_SIZE) && (done == 1))){ for(int i = 0; i < count; i++){ #pragma HLS pipeline out[idx + i] = buffer[i]; } idx += count; count = 0; } } } extern "C" { void cnd_stream(int512_dt *in, int *out, int size){ #pragma HLS INTERFACE m_axi port=in offset=slave bundle=gmem #pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem #pragma HLS INTERFACE s_axilite port=in bundle=control #pragma HLS INTERFACE s_axilite port=out bundle=control #pragma HLS INTERFACE s_axilite port=len bundle=control #pragma HLS INTERFACE s_axilite port=return bundle=control hls::stream<int512_dt> in_stream; hls::stream<int> out_stream; hls::stream<uint1_dt> done_stream; int len = size / WORD_NUM; #pragma HLS STREAM variable=in_stream depth=16 #pragma HLS STREAM variable=out_stream depth=64 #pragma HLS STREAM variable=done_stream depth=16 #pragma HLS dataflow //dataflow pragma instruct compiler to run following three APIs in parallel read_input(in, in_stream, len); compute(in_stream, done_stream, out_stream, len); write_back(out_stream, done_stream, out); } }
08-01-2017 09:49 PM
Although use arguments of different bitwidth is common in software design, currently it will not work on hardware if these arguments are bundled in same AXI master interface. In this case, port in of ap_int<512> and port out of int32 are bundled together. As a result, both ports don't work, no data is read in and hw_emu stalls.
Another problem is the depth of stream. As the data width is extended, read_input module read far more data than before in one call, inadequate stream depth will also stall the function when the stream is full.
07-31-2017 11:40 PM
Hi Liucheng,
Can you also attach the host code and makefile?
Regards,
Sean
08-01-2017 01:51 AM
hi, @seanz
Thanks for the reply.
Here is the host code and makefile.
//host code
#include <iostream> #include <cstring> #include <cstdlib> //OpenCL utility layer include #include "xcl.h" #define DATA_SIZE (16*1024*1024) #define INC 10 int main(int argc, char** argv) { //Allocate Memory in Host Memory size_t vector_size_bytes = sizeof(int) * DATA_SIZE; int *source_input = (int *) malloc(vector_size_bytes); int *source_hw_results = (int *) malloc(vector_size_bytes); int *source_sw_results = (int *) malloc(vector_size_bytes); // Create the test data and Software Result for(int i = 0 ; i < DATA_SIZE ; i++){ source_input[i] = rand()%100; source_sw_results[i] = -1; source_hw_results[i] = -1; } int idx = 0; for(int i = 0; i < DATA_SIZE; i++){ if(source_input[i] > INC){ source_sw_results[idx] = source_input[i] + INC; idx++; } } std::cout << "# of write back: " << idx << std::endl; //OPENCL HOST CODE AREA START //Create Program and Kernel xcl_world world = xcl_world_single(); cl_program program = xcl_import_binary(world, "cnd_stream"); cl_kernel krnl_cnd_stream = xcl_get_kernel(program, "cnd_stream"); //Allocate Buffer in Global Memory cl_mem buffer_input = xcl_malloc(world, CL_MEM_READ_ONLY, vector_size_bytes); cl_mem buffer_output = xcl_malloc(world, CL_MEM_READ_WRITE, vector_size_bytes); //Copy input data to device global memory xcl_memcpy_to_device(world,buffer_input,source_input,vector_size_bytes); xcl_memcpy_to_device(world,buffer_output,source_hw_results,vector_size_bytes); int size = DATA_SIZE; //Set the Kernel Arguments xcl_set_kernel_arg(krnl_cnd_stream,0,sizeof(cl_mem),&buffer_input); xcl_set_kernel_arg(krnl_cnd_stream,1,sizeof(cl_mem),&buffer_output); xcl_set_kernel_arg(krnl_cnd_stream,2,sizeof(int),&size); std::cout << "start launching the program." << std::endl; //Launch the Kernel unsigned long duration = xcl_run_kernel3d(world,krnl_cnd_stream,1,1,1); //Copy Result from Device Global Memory to Host Local Memory xcl_memcpy_from_device(world, source_hw_results, buffer_output,vector_size_bytes); clFinish(world.command_queue); double bandwidth = ((DATA_SIZE + idx) * sizeof(int) / 1024.0 / 1024.0) / (duration * 1.0 / 1000000000); std::cout << "Measured bandwidth is " << bandwidth << " MB/s" << std::endl; //Release Device Memories and Kernels clReleaseMemObject(buffer_input); clReleaseMemObject(buffer_output); clReleaseKernel(krnl_cnd_stream); clReleaseProgram(program); xcl_release_world(world); //OPENCL HOST CODE AREA END // Compare the results of the Device to the simulation int match = 0; for (int i = 0 ; i < DATA_SIZE ; i++){ if (source_hw_results[i] != source_sw_results[i]){ std::cout << "Error: Result mismatch" << std::endl; std::cout << "i = " << i << " CPU result = " << source_sw_results[i] << " Device result = " << source_hw_results[i] << std::endl; match = 1; break; } } /* Release Memory from Host Memory*/ free(source_input); free(source_hw_results); free(source_sw_results); if (match){ std::cout << "TEST FAILED." << std::endl; return EXIT_FAILURE; } std::cout << "TEST PASSED." << std::endl; return EXIT_SUCCESS; }
Here is the Makefile
COMMON_REPO := ../../../ include $(COMMON_REPO)/utility/boards.mk include $(COMMON_REPO)/libs/xcl/xcl.mk include $(COMMON_REPO)/libs/opencl/opencl.mk # Host Application host_SRCS=./src/host.cpp $(xcl_SRCS) host_HDRS=$(xcl_HDRS) host_CXXFLAGS=-I./src/ $(xcl_CXXFLAGS) $(opencl_CXXFLAGS) --debug host_LDFLAGS=$(opencl_LDFLAGS) EXES=host # Kernel cnd_stream_SRCS=./src/cnd_stream.cpp cnd_stream_CLFLAGS= --kernel cnd_stream XOS=cnd_stream # xclbin cnd_stream_XOS=cnd_stream XCLBINS=cnd_stream # check check_EXE=host check_XCLBINS=cnd_stream DEVICES=xilinx:adm-pcie-7v3:1ddr:3.0 TARGETS=sw_emu CHECKS=check include $(COMMON_REPO)/utility/rules.mk
Regards,
Cheng Liu
08-01-2017 09:49 PM
Although use arguments of different bitwidth is common in software design, currently it will not work on hardware if these arguments are bundled in same AXI master interface. In this case, port in of ap_int<512> and port out of int32 are bundled together. As a result, both ports don't work, no data is read in and hw_emu stalls.
Another problem is the depth of stream. As the data width is extended, read_input module read far more data than before in one call, inadequate stream depth will also stall the function when the stream is full.
08-01-2017 10:22 PM
Hi, @seanz,
Thank you very much for the help.
Yes, here it is caused by the different data width setup bundled in the same AXI master.
After bundling the two arguments to separate axi masters, the design works as expected.
Sure, I will be more careful while setting up the stream fifo depth. In this design, the setup
works just fine.
Regards,
Cheng Liu