cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
e_ensafi
Explorer
Explorer
809 Views
Registered: ‎08-13-2020

AIE - Waiting For Results Using PL GMIO in Graph vs. XRT

Jump to solution

One limitation of the new PL GMIO API in Vitis 2020.2 AIE Tools is that there is not GMIO::wait() for GMIO::pl_gm().  For example, if you want to run N iterations using adf::graph::run(N) and wait for results from each iteration, it's quite easy to accomplish this with GMIO::gm2aie_nb() and GMIO::aie2gm_nb() followed by GMIO::wait() in the PS host application.  However, if you call GMIO::pl_gm() there is no corresponding wait function, so you need to wait for the graph to finish executing a finite number of iterations before accessing the final results, which could potentially be the result of N iterations buffered up in DDR memory.  This is probably acceptable if you have enough memory to hold N iterations worth of data, and if near real-time processing is not important in your application.  However, if you want to obtain results from each iteration as soon as they are available, the only solution that I can think of is to call adf::graph::run(1) in a loop.  There must be a performance hit associated with running a graph in this manner, isn't there?  It sounds more efficient to kick of an infinite AIE run and allow the PS to acquire results in another thread.  If calling run(1) in a loop 1000 times is as efficient, within a negligible margin, as run(1000) then I stand corrected.  Otherwise, how would one obtain intermediate iteration results using GMIO::pl_gm() or, if this is not technically feasible, then perhaps by using XRT library calls?  Can we invoke mm2s/s2mm PL kernels in a loop on the PS while the graph is running, where each mm2s/s2mm invocation only processes enough data for a single iteration, so we basically emulate the GMIO style DMA and wait logic using XRT?

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
derekh
Xilinx Employee
Xilinx Employee
772 Views
Registered: ‎08-06-2018

Hi @e_ensafi 

The relation between responsiveness and throughput is one of the important design decision you need to make for your system. In many cases they contradict each other and when involving shared resources like DDR through GMIO and PS applications competing for process threads it can be tricky to schedule optimally.

Invocation of kernels or functions are almost always associated with some overhead, so the larger iteration/longer run per iteration, this overhead become less significant. But with longer run per iteration, the responsiveness is affected.

If latency and responsiveness is your driving constraint, I would look into having the AI Engine graph run infinitely and use PL kernels for the arranging the low level frame/buffering and iteration result analysis.
The PS can work with the high level frame/buffers feeding data from DDR Mem to PL buffers if your data source starting point is memory.
This allows you to keep the AI Engine busy keeping interference by the PS only when needed of the high level perspective.

If your data originate from transceivers or similar transportation source, depending on the data structure, it may be possible to directly arrange the frame/buffers in PL without intermediate storage in DDR Mem.

From an API perspective, you can treat the mm2s/s2mm PL kernels as separate queues feeding/extracting data from a infinite running graph. As long as data is written and read to the queues, the AIE graph is running. When the mm2s queue is empty or the s2mm queue is full, the graph will stall.
Note that with this, the AI Engine graph is still primed to calculate, once you populate the write queue and empty the read queue.
Care should be taken so there is sufficient data in the input queue to produce a complete window (frame) of output data.

 

Derek
SAE DSP and AI Engine, Xilinx Sweden/EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**

View solution in original post

11 Replies
derekh
Xilinx Employee
Xilinx Employee
773 Views
Registered: ‎08-06-2018

Hi @e_ensafi 

The relation between responsiveness and throughput is one of the important design decision you need to make for your system. In many cases they contradict each other and when involving shared resources like DDR through GMIO and PS applications competing for process threads it can be tricky to schedule optimally.

Invocation of kernels or functions are almost always associated with some overhead, so the larger iteration/longer run per iteration, this overhead become less significant. But with longer run per iteration, the responsiveness is affected.

If latency and responsiveness is your driving constraint, I would look into having the AI Engine graph run infinitely and use PL kernels for the arranging the low level frame/buffering and iteration result analysis.
The PS can work with the high level frame/buffers feeding data from DDR Mem to PL buffers if your data source starting point is memory.
This allows you to keep the AI Engine busy keeping interference by the PS only when needed of the high level perspective.

If your data originate from transceivers or similar transportation source, depending on the data structure, it may be possible to directly arrange the frame/buffers in PL without intermediate storage in DDR Mem.

From an API perspective, you can treat the mm2s/s2mm PL kernels as separate queues feeding/extracting data from a infinite running graph. As long as data is written and read to the queues, the AIE graph is running. When the mm2s queue is empty or the s2mm queue is full, the graph will stall.
Note that with this, the AI Engine graph is still primed to calculate, once you populate the write queue and empty the read queue.
Care should be taken so there is sufficient data in the input queue to produce a complete window (frame) of output data.

 

Derek
SAE DSP and AI Engine, Xilinx Sweden/EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**

View solution in original post

florentw
Moderator
Moderator
767 Views
Registered: ‎11-09-2015

Hi @e_ensafi 

Just to add to @derekh 's reply, there is maybe something you forgot to consider: it it possible to generate interrupts from the HLS IP.

If you work with interrupts to know when the mm2s and s2mm are done executing, this can give you the flexibility you are looking at stating a new transaction when the previous is complete.


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
e_ensafi
Explorer
Explorer
742 Views
Registered: ‎08-13-2020

@florentw @derekh Is GMIO::wait() implemented using interrupts internally? There is usually a performance hit when processing interrupts. If there is a more efficient way, I'd like to look into it, but if that's what GMIO::wait() is already doing, then perhaps that's what I will do as well.

0 Kudos
florentw
Moderator
Moderator
701 Views
Registered: ‎11-09-2015

HI @e_ensafi 

I am not sure if there is an interrupt or if the PS is just looking at the complete signal. But there need to be a mechanism to synchronize the NoC in the processor. I do not think this happens by "magic". 


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
e_ensafi
Explorer
Explorer
687 Views
Registered: ‎08-13-2020

@florentw  I was thinking more along the lines of synchronization events, semaphores, etc.  But those are probably not things that exist in bare metal FPGA land, only on the ARM under PetaLinux.  I will have to read up more on handling and generating interrupts between the PS and PL.  Not sure where to start in the documentation.

0 Kudos
e_ensafi
Explorer
Explorer
325 Views
Registered: ‎08-13-2020

@derekh Besides ease of use, is there a difference in terms of available resources when using mm2s/s2mm PL kernels via XRT outside of the graph and via PL GMIO inside the graph? I have migrated any application from Vitis 2020.1 using GMIO to 2020.1 using PL GMIO. The graph application is still stalling (or eventually deadlocking, actually). The kernel is expecting two stream inputs from two mm2s kernels. I set the PL clock frequency to 250 MHz, and both the PL (HLS) and AIE kernels use 128-bit reads.  For HLS, I use via ap_int<128> memory pointers and hls::stream<ap_axis<128, 0, 0, 0>> streams.  For AIE, I use readincr_v8/writeincr_v8 with v8int16 and input/output_stream<int16>. I even tried PLIO with text properly sized text files, and it behaved the same way.

The only thing I can think of is p. 63 of UG1076, Figure 5: Producer and Consumer Kernels with Reconvergent Streams.  I have two mm2s PL kernels attached to one AIE kernel with two inputs and one output. Indeed, I am likely pushing data from the PL onto one stream while the AIE is reading from the other stream.  Is the expectation that data is read consistently from both streams?  The amount of data being written to one stream is significantly larger than the other, and for ever N bytes I read from the smaller stream, I read M >> N bytes from the larger stream.  Is is preferable to interleave the data and use only one stream?  In that case, the mm2s PL kernel would have two take two memory pointers from the PS, read the proper amount of data from each, and output the interleaved result to a single stream destined for the AIE.  I am not sure if this will really alleviate any potential back pressure that might be causing the stall/deadlock issue.

0 Kudos
florentw
Moderator
Moderator
554 Views
Registered: ‎11-09-2015

Hi @e_ensafi 

Sorry I missed your reply on this topic?

Have you been able to make progress on having interrupts between PS and PL.

Note that might be more a question for the Processor System Design and AXI (PS/PL interrupts) or High-Level Synthesis (HLS) (generating interrupts from HLS IP) boards


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
e_ensafi
Explorer
Explorer
548 Views
Registered: ‎08-13-2020

@florentw No problem, thank you for the links.  That's what I needed.  Due to bigger problems with deadlocking, I have not even pursued this.  First I need to run one iteration to completion without deadlocking, and then I need to move on to multiple iterations, interrupts, etc.

0 Kudos
derekh
Xilinx Employee
Xilinx Employee
320 Views
Registered: ‎08-06-2018

@e_ensafiIf you experience deadlock or stalling on a design using stream input/output, you may need to consider adding a few fifos to loosen the dependencies of your kernels.

Check out UG1076 Stream FIFO Depth chapter.

Derek
SAE DSP and AI Engine, Xilinx Sweden/EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
e_ensafi
Explorer
Explorer
282 Views
Registered: ‎08-13-2020

@derekh We tried increasing fifo_depth but it didn't seem to help.

@florentw @derekh Regarding interrupts generated in HLS, will these be simulated by aiesimulator and/or x86simulator? Another option is to poll the first value of the memory pointer (e.g. the ap_int<128>* mem argument of the mm2s function) for an update. Using this value as a frame counter, as soon as it increments, we read the rest of the data. I'm not sure which is more efficient. Also, I was concerned whether a tight loop on the PL would consume more power?

0 Kudos
derekh
Xilinx Employee
Xilinx Employee
240 Views
Registered: ‎08-06-2018

Hi @e_ensafi 

Can you please ask the new questions in a separate thread. This topic have been marked as solved and we will not be able to mark your new questions as resolved in the same thread.

Thanks!

Derek
SAE DSP and AI Engine, Xilinx Sweden/EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos