cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
benedetto73
Adventurer
Adventurer
1,049 Views
Registered: ‎09-30-2019

DATAFLOW: multiple circular buffers read-only for consumers, write-only for producers

I understand that this might not be feasible in HLS...

So... imagine a normal DATAFLOW scenario where the "refillers" read from independent memory banks and ports.
Each Refiller writes data as fast as it can to a local "circular buffer".
I would like such circular buffers to be "normal" buffers in BRAM or URAM.
The items in the buffers are ap_uint<128>. And I would like each buffer to be around 32/64/128 items.

The consumer is a fast function in a tight loop that produces an output every 3/4 cycles.
The consumer at each iteration advances the various buffers depending on the data.
So there must be some kind of feedback mechanism in here.
I know... Dataflow doesn't support feeback.

I wonder if there's any way of achieving this.

 

multiple cb.png

 

PS: the algorithm works fine when placed inside a single loop. However, it runs much slower our performance targets. There are lots of dependencies in such big loop and the compiler, I suppose, schedules lots of stuff sequentially.

Tags (2)
0 Kudos
9 Replies
dishlamian
Observer
Observer
994 Views
Registered: ‎12-03-2019

Not an expert on Xilinx's toolchain [yet] but I think what you are looking for are non-blocking streams. Search for "Non-Blocking Reads and Writes" in UG902. What you probably want is multiple producers, each writing to an independent stream, and the consumer looping over all the streams and checking them one-by-one to see if any of them has valid data. The true/false return of the read operation on the consumer can be used to determine whether you should exit the pooling loop or not. The whole consumer code will then be wrapped inside a while loop so that the consumer goes back to the pooling loop when it processed the data it already read. Not sure if Xilinx's compiler would allow it but I would also try unrolling the pooling loop completely.

0 Kudos
benedetto73
Adventurer
Adventurer
953 Views
Registered: ‎09-30-2019

Hi @dishlamian , thanks for looking at my issue.

I am not sure your suggestion can work here. In my design, the consumer is a very fast kernel that produces an output every 3/4 cycles.
At every iteration it needs to consume variable amount of data for the various incoming sources. Hence I cannot utilize streams here, because they woudl be too slow.
I would like to use local BRAM/URAM caches, one per incoming data.

The ideal scenario would be that the consumer (only!) reads the local N caches while the N producers fills up the same local caches (each its own cache).

At this point I am not even sure that this is feasible in HLS.

PS: I am going to edit the post so it is more clear.

0 Kudos
dishlamian
Observer
Observer
907 Views
Registered: ‎12-03-2019

HLS streams, depending on their depth, are implemented either as shift registers using a chain of registers, or as FIFOs using Block RAMs, and enable single-cycle read and write with possibility of stalling when the stream is full or empty. I am not sure why you think HLS streams are slow. Since the streams have a configurable depth, you can adjust their depth to minimize stalls for the producers. The consumer can potentially poll all the streams in one cycle if you can unroll the polling loop and quickly go back to the polling loop as soon as it processes the data it received from any of the streams. If the rate of data generation is too high for the consumer to catch up, then you can use more than one consumer kernel.

P.S. Note that here, I am talking about using the hls::stream class (UG902, HLS Stream Library) to stream data between kernels in a design via FIFOs, and not streaming data to/from external memory or host by streaming the kernel's input/output arrays.


0 Kudos
benedetto73
Adventurer
Adventurer
880 Views
Registered: ‎09-30-2019

Thanks again for your help @dishlamian , I get your point.

I am trying to do what you suggest: passing entire buffers via dataflow streams, like this:

struct Cache {    ap_uint<128>x[64]; };
hls::stream<Cache>

What I see from the HW Emulator is that the consumer, when receiving the new data, copies the entire buffer before processing it.
I thought it'd be more a simple swap instead.

If I can use something like this where the consumer processes several buffers while the producers fill up the next buffers... I would be happy I guess.

 

0 Kudos
dishlamian
Observer
Observer
872 Views
Registered: ‎12-03-2019

That will give you a FIFO with a depth of 2 (2 is default) and a width of 128 x 64 bits. That is probably not what you want. What you probably want is hls::stream<ap_uint<128>>, followed by #pragma HLS stream variable=*variable name* depth=64. This way, only 128 bits will read/written in each attempt, and the FIFO will not stall unless there are 64 such values in the FIFO that have not been read by the consumer. I would recommend taking a look at the following as a typical example of how to use the hls::stream class:

https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/cpp_kernels/dataflow_stream/src/adder.cpp

0 Kudos
benedetto73
Adventurer
Adventurer
857 Views
Registered: ‎09-30-2019

@dishlamian , in what you propose ( hls::stream<ap_uint<128>> with a depth of 64)... doesn't the consumer get one single word per read?

I need the entire buffer to be swapped at once.

 

0 Kudos
dishlamian
Observer
Observer
852 Views
Registered: ‎12-03-2019

It will get one word per read, but will each of your producers be generating 64 words in one go? If the producers are generating words one by one, you might as well read and process the words one by one in the consumer, too, rather than having the consumer wait until 64 words are produced by [at least] one producer, and then consume them in one go (which I doubt would be doable in the 3-4 cycles you mentioned earlier, either). If you really want 64 words to be transferred from each producer to the consumer with each write to the stream, then the struct you mentioned earlier is exactly what you want.

0 Kudos
benedetto73
Adventurer
Adventurer
847 Views
Registered: ‎09-30-2019

@dishlamian thanks again for youyr help!
The producers simply read from memory.
But I cannot afford the consumer to read words one by one. For each channel.
So swapping the entire buffer is the only alternative I suppose in HLS (the real goal would be circular buffers with reads from the consumer and writes from producer).

Anyhow, in the struct scenario the HW emulator shows 250ns delay each time the buffer is swapped. Which indicates that data is actually being copied from the "stream" to the function.

I would like to avoid that.
Is it possible to have just a "swap" when I read the struct in the consumer?

0 Kudos
dishlamian
Observer
Observer
820 Views
Registered: ‎12-03-2019

What is your target operating frequency? My understanding is that writes to and reads from FIFOs should happen in one clock cycle, regardless of FIFO width/depth. Is that 250 ns just for one producer write followed by one consumer read? Or does that also include the latency of read from external memory? Also, are the loops in your kernels pipelined with an II of one?

The only way to avoid the data movement completely is to have the producer and the consumer in the same kernel, which you have already tried and concluded that it prevents the compiler from achieving an II of one. On-chip buffers in one FPGA kernel are not accessible by other kernels, so, there is no choice but to move the data from one kernel to another. However, on-chip data movements on FPGAs is extremely fast and would not normally become a bottleneck in a design, or at least not before the off-chip memory becomes the bottleneck.

One work I know of which have attempted something similar is this:

https://wangzeke.github.io/doc/multikernelpartitioning-tvlsi-17.pdf

Albeit that is done on Intel FPGAs and uses OpenCL channels, but OpenCL channels are implemented in the same way as HLS streams are. Maybe you can replicate the same design pattern and achieve what you want.

0 Kudos