UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Clarification on High Performance Matrix Multiplication Example

Reply
Highlighted
Newbie
Posts: 1
Registered: ‎04-15-2018

Clarification on High Performance Matrix Multiplication Example

As a newbie to SDAccel, I was hoping to clarify some parts of the SDAccel High Performance Matrix Multiplication example that I was browsing through. Specifically, I was hoping for elaboration on sections of https://github.com/Xilinx/SDAccel_Examples/blob/51416734fd694773a2ab4991f027e5c78e09c9a8/acceleration/high_perf_mat_mult/src/high_perf_mat_mult.cpp.

 

I wasn't too sure how the parallel_rows, parallel_columns, and depth variables factored into the overall multiplication process and how they allowed for optimization. In addition, it seems like the file I linked to is the testbench corresponding to kernelSgemm_0, which contains the actual CL code. Is that true? And if so, how can I access the CL-specific code?

 

Thanks,

Peter

Xilinx Employee
Posts: 107
Registered: ‎09-08-2011

Re: Clarification on High Performance Matrix Multiplication Example

Hi boydun,

The host code in this exaple both calls the kernel, and acts as a bit of a test bench to compare the results.

So the kernel is created (loaded):

xcl_world world = xcl_world_single();
cl_program program = xcl_import_binary(world, "high_perf_mat_mult0"); // compute programs
cl_kernel kernelSgemm_0 = xcl_get_kernel(program, "kernelSgemm_0"); // compute kernel

and then based on the inputs to the program memory in the host is setup for the matrix. And that memory is filled with values.


Pointers to these memorys are setup:

cl_mem_ext_ptr_t d_a_ext;
cl_mem_ext_ptr_t d_b_ext;
cl_mem_ext_ptr_t d_d_ext;
cl_mem_ext_ptr_t d_c_ext;

Some work is done to point them to the respective banks for specific cards

and then the buffers are created, and copied to the device


d_a = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX, sizeof(short) * num_of_rows * depth, &d_a_ext, &err);
assert(err == CL_SUCCESS);
d_b = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX, sizeof(short) * (num_of_cols/2) * depth , &d_b_ext, &err);
assert(err == CL_SUCCESS);
d_d = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX, sizeof(short) * (num_of_cols/2) * depth , &d_d_ext, &err);
assert(err == CL_SUCCESS);
d_c = clCreateBuffer(world.context, CL_MEM_WRITE_ONLY | CL_MEM_EXT_PTR_XILINX, sizeof(short) * num_of_rows * num_of_cols, &d_c_ext, &err);
assert(err == CL_SUCCESS);

std::cout << "Copying Buffers to device...." << std::endl;
xcl_memcpy_to_device(world,d_a,h_a,sizeof(short) * num_of_rows * depth);
xcl_memcpy_to_device(world,d_b,h_b,sizeof(short) * (num_of_cols/2) * depth);
xcl_memcpy_to_device(world,d_d,h_d,sizeof(short) * (num_of_cols/2) * depth);

And finally it's enqueued and run and the results are wainted on.:

err = clEnqueueTask(world.command_queue, kernelSgemm_0, 0, NULL, &ks_0_event);
err = clFinish(world.command_queue);

Then the results are readback, checked, and some details of how fast it ran are generated.

the .xo file that is also in the directory is the actual kernel. All the above is host code. For the high performance example we don't have the CL code on the github as I can tell.

 

Does that answer your question? Or was there a section you wanted to know more about from this example?

Regards,

Evan

If at first you don't succeed, try redefining success?