cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
adel_chaker
Visitor
Visitor
728 Views
Registered: ‎10-24-2019

writing and reading in the same array

Jump to solution

Hello,
I'm a student working on a project, and I encountered the following problem:
my goal is to store a certain number of coef in an array, to use them in a calculation function. After a few tries I realized that the maximum number of calculation functions I can parallel is 8. So I fix my code with #pragma to limit the number of parallel function calls.
But the problem I have now is that even though I have limited the number of calls to the parallel calculator function, it reads and stores all the coefficients, whereas I just want to store the coefficients needed for the 8 current calculator functions.

void top_level_function(char coef_array[TOTAL_NUMBER_OF_ARRAY])
(
//Code
//...
main_loop:...
{
#pragma HLS PIPELINE

interne_loop: for all coefs
{
partiel_loading_of_array_loop:{#pragma HLS array_partition variable=partiel_coefs complete}
#pragma HLS allocation instances=calcul limit=8 function
calcul()
}

}
)

I also tried to create a function to partially load the coefficients and do a pragma like for the calculate function but it doesn't work. It creates dependencies between the select operation and the zext operation.
in other words I would like that when I load the necessary coefs for the calculation part, I would like to reuse this same space to store the next coefs.

thank you in advance for your precious help.

0 Kudos
1 Solution

Accepted Solutions
u4223374
Advisor
Advisor
621 Views
Registered: ‎04-26-2015

Could we possibly get the complete code? It's sort of hard to see what's going on with such a small example.

 

There are basically two ways to achieve what you're requesting. One is to have a loop containing two functions, one preparing the buffer and the other using it:

for (int i = 0; i < 1000; i++) {
	prepare_buffer(buff_A);
	use_buffer(buff_B);
	// Swap buffers
	buff_C = buff_A;
	buff_A = buff_B;
	buff_B = buff_C;
}

Except that HLS doesn't appreciate this method of swapping buffers, and you need a bunch more control code to handle start/end cases. Nevertheless, this is the way I always end up using.

 

The other approach is to use dataflow:

{
#pragma HLS DATAFLOW
#pragma HLS STREAM variable=buff depth=10
prepare_buffer(buff);
use_buffer(buff);
}

This tells HLS to just stream the data from prepare_buffer to use_buffer, with only a small number of elements held at any time. Streaming puts a number of limitations on accessing the elements (eg. must be in order, must read/write every element exactly once) but it should give both better performance and fewer resources. However, the latest version of HLS that I regularly use is 2016.2, and in that the whole dataflow system is pretty fragile. Hopefully it's better in newer versions.

View solution in original post

3 Replies
u4223374
Advisor
Advisor
622 Views
Registered: ‎04-26-2015

Could we possibly get the complete code? It's sort of hard to see what's going on with such a small example.

 

There are basically two ways to achieve what you're requesting. One is to have a loop containing two functions, one preparing the buffer and the other using it:

for (int i = 0; i < 1000; i++) {
	prepare_buffer(buff_A);
	use_buffer(buff_B);
	// Swap buffers
	buff_C = buff_A;
	buff_A = buff_B;
	buff_B = buff_C;
}

Except that HLS doesn't appreciate this method of swapping buffers, and you need a bunch more control code to handle start/end cases. Nevertheless, this is the way I always end up using.

 

The other approach is to use dataflow:

{
#pragma HLS DATAFLOW
#pragma HLS STREAM variable=buff depth=10
prepare_buffer(buff);
use_buffer(buff);
}

This tells HLS to just stream the data from prepare_buffer to use_buffer, with only a small number of elements held at any time. Streaming puts a number of limitations on accessing the elements (eg. must be in order, must read/write every element exactly once) but it should give both better performance and fewer resources. However, the latest version of HLS that I regularly use is 2016.2, and in that the whole dataflow system is pretty fragile. Hopefully it's better in newer versions.

View solution in original post

adel_chaker
Visitor
Visitor
603 Views
Registered: ‎10-24-2019

Thank you very much.
after reading some of the dataflow examples and the questions on the forum. i was able to answer a lot of questions.
I was able to move forward, after a synthesis of my project. I get for the port of coefs:

// KERNEL_BUS
// 0x0100 ~
// 0x01ff : Memory 'kernel_0' (256 * 8b)
//          Word n : bit [ 7: 0] - kernel_0[4n]
//                   bit [15: 8] - kernel_0[4n+1]
//                   bit [23:16] - kernel_0[4n+2]
//                   bit [31:24] - kernel_0[4n+3]
// 0x0200 ~
// 0x02ff : Memory 'kernel_1' (256 * 8b)
//          Word n : bit [ 7: 0] - kernel_1[4n]
//                   bit [15: 8] - kernel_1[4n+1]
//                   bit [23:16] - kernel_1[4n+2]
//                   bit [31:24] - kernel_1[4n+3]
// 0x0300 ~
// 0x03ff : Memory 'kernel_2' (256 * 8b)
//          Word n : bit [ 7: 0] - kernel_2[4n]
//                   bit [15: 8] - kernel_2[4n+1]
//                   bit [23:16] - kernel_2[4n+2]
//                   bit [31:24] - kernel_2[4n+3]
// 0x0400 ~
...
#define XCONV_3D_PARALLEL_KERNEL_BUS_ADDR_KERNEL_0_BASE   0x0100
#define XCONV_3D_PARALLEL_KERNEL_BUS_ADDR_KERNEL_0_HIGH   0x01ff
#define XCONV_3D_PARALLEL_KERNEL_BUS_WIDTH_KERNEL_0       8
#define XCONV_3D_PARALLEL_KERNEL_BUS_DEPTH_KERNEL_0       256
#define XCONV_3D_PARALLEL_KERNEL_BUS_ADDR_KERNEL_1_BASE   0x0200
#define XCONV_3D_PARALLEL_KERNEL_BUS_ADDR_KERNEL_1_HIGH   0x02ff
#define XCONV_3D_PARALLEL_KERNEL_BUS_WIDTH_KERNEL_1       8
#define XCONV_3D_PARALLEL_KERNEL_BUS_DEPTH_KERNEL_1       256
#define XCONV_3D_PARALLEL_KERNEL_BUS_ADDR_KERNEL_2_BASE   0x0300
#define XCONV_3D_PARALLEL_KERNEL_BUS_ADDR_KERNEL_2_HIGH   0x03ff
...

after thinking about it I decided to synthesize 4 calculation modules in parallel. So I added the following "pragma" directive to my port:
#pragma HLS array_partition variable=kernel cyclic factor=108
where each kernel has 27 values, so to have access to all the elements I made 27*4= 108. I would like to know how to deposit my data and which address corresponds to what?

my first idea is that:
at the 0x100 : i have 4 byte : byte 0 = value 0 from kernel 0
byte 1 = value 0 from kernel 4
byte 2 = value 0 from kernel 8
byte 3 = value 0 from kernel 12
at the 0x104 : i have 4 bye : byte 0 = value 0 from kernel 16
...
but it doesn't work the way I want. It works for all 27 kernel values from 0 to 3 (byte 0 of addresses 0x100 0x200 0x300 ... 0x6x00).

but when I write in the byte 1 of the address 0x100 for example it is the kernel 5 which receives the value and not the 4 ?

the instruction I use in python :
conv_kernel.write(0x200, 0x100) #example where I write 1 in byte 1

 

0 Kudos
adel_chaker
Visitor
Visitor
562 Views
Registered: ‎10-24-2019

after rereading my python excuter code I discovered a small problem of synchronization between the data of the instantiated modules. I was able to solve it.

and so what I thought I understood about the layout of the data in the BRAM was quite correct.

now I find a new problem in creating an output array with the "Xlnk" library, the command: "output_buffer = xlnk.cma_array(shape=(out_size,), dtype=np.uint64)" returns me: "Failed to allocate Memory!" I read that it was the defined size (too big) for Linux and so I had to edit the img file. am I on the right track?

Anyway, thanks again @u4223374 for your help.

0 Kudos