12-24-2019 04:47 AM
I am currently dealing with 2D-FFT of an 256x256 image on Zynq SoC. For this purpose, I transfer 256*4 consecutive bytes from DDR to BRAM and transfer them back to DDR successfully. However, after calculating along the first axis, I need to transfer 256*4 bytes from discontiguous locations which are seperated by 256*4 bytes from each other.
Currently, I have experienced using CDMA in simple mode which is enough for me the consecutive reads but not for discrete reads. I tried using CDMA in scatter-gather mode for the transfer that requires strides in memory locations but it's overhead causing transfer to be very slow. Since every data transfer is 4 bytes, every transfer initiation results in great overhead.
During my research I came across with AXI Video DMA. This IPCore's Product Guide mentions "2D-Transfers" however, I couldn't find any further information about this topic at this document. Moreover, I have seen the term "AXI DMA with strides" in internet yet I couldn't find additional information on this either.
I am trying to reach transposed elements of a 256x256 2D matrix and abovementioned methods either failed or I couldn't find additional information on them. I would be very happy if you could provide me some guidance for 2D transfer using DMA or VDMA. Or any other technique reference is welcome.
Thank you for your kind help.
12-24-2019 06:23 AM
The short answer is that there is not a good (fast or efficient) solution for transfering discontiguous data for a 2D matrix that is stored in DDR memory. It's a side effect of the physical DDR memory and how it is structured.
The Video DMA (VDMA) does handle 2D transfers; but it still operates in a row-wise fashion. It's advantage is that it allows the user to simply specify the start location of the buffer, the length of each line, the number of lines and the amount of "padding" between the end of a line and the start of the next line. The "stride" is the length of the line + the amount of padding between the lines. It does not provide an easy way to read the 2D array in a column-wise fashion.
You could improve the column-wise reads by reading several pieces of row-wise data and then storing the extra pieces in BRAM until needed in a later round of FFT processing. That way you wouldn't have to access the DDR as often. The first column-wise FFT will be slow as it pulls data from the DDR. The next couple of column-wise FFTs will be fast as they can pull their data from the temporary BRAM storage. Something like this will likey require some custom logic.
If you have enough BRAM, you might want to consider keeping the results of the first round of FFT processing in a BRAM buffer. Discontiguous accesses in BRAM are fast. A 256x256 buffer should be manageable in BRAM.
12-24-2019 08:10 AM
03-18-2020 11:24 PM
Hi, I'm doing something similar on a much larger array. Did you ever work out the quickest way to access data via column ? I'm using a customer AXI4 memory mapped interface to access the PS DDR using a burst size of one. I understand that accessing DDR this way is slow but I'm finding there is also a lot of overhead on the AXI bus having to acknowledge each transaction. Is there a faster way ? Would it be faster to do this if I used PL DDR (which I dont currently have but for a future version)?
03-20-2020 05:14 AM
Yes, the Xilinx tools add a lot of overhead to using the AXI bus. Some of this can be controlled through things like interconnect settings, some of it you might be stuck with.
Yes, if you use a PL SDRAM you can control how much overhead you struggle with in your design. Thankfully the MIG controller has a minimal overhead, although it does struggle with significant lag.
03-20-2020 06:06 PM
Hi @dgisselq , thanks for your reply.
I see you have responded to my other question on this subject on another post so I'll remain on this thread now.
1) Are the AXI interconnect settings you mention accessible by double clicking on the Zync pin ? I see it brings up a number of setting however I cant change any of them. How are these changed ?
2) I'm currently getting a 60 clock pulse delay between consecutive reads so I hope to be able to significantly reduce this on the current board using PS DDR. Can you recommend any particular settings that will improve this ?
3) As an aside, I attempted to use HP3_FPD and received an error message that its clock was not connected (even though it was) so I just don't use it. Any ideas on this ?
03-20-2020 07:22 PM
The interconnect settings I am familiar with are not the settings within the Zynq.
My experience with the DDR SDRAM is with using the MIG as a DDR3 SDRAM controller with an AXI interface. This controller works quite well--unlike some of the other Xilinx AXI cores I've examined. I would be disappointed, although perhaps not surprised as I'd like to be, if the Zynq couldn't do better than a 60 clock round trip. Double check your implementation, though. Are you issuing read request upon read request before ever getting an acknowledgment? You will need to do so if you wish any kind of performance. While it's easy to build a simple AXI master, building the kind of master I'm discussing takes a bit more work. You'll want to make certain that ARVALID && ARLEN == 0 on every clock cycle until you've had ARREADY high once per horizontal row. You might improve performance if you could do a larger ARLEN, but for the sake of discussion I'm assuming ARLEN == 0.
I have no ideas regarding HP3_FPD. I have yet to try personally using that interface myself.
If it helps at all, I can offer some AXI examples. Here, for example, is a simple AXI master. It issues one request at a time, and waits for the response from that request before issuing a second request. Yes, you could code your master that way. No, you wouldn't be likely to get the performance you are looking for by doing so. This type of master you would want might look more like this one, but with ARLEN set to zero. (That core requires an unused clock cycle to recycle ARVALID for the next cycle--you'd want to remove that in your implementation.) There's also a wbm2axip master in the same repository as the last one that might be valuable to you--that one can issue one request per clock with ARLEN == 0, as I'm recommending to you.
While I'd love to offer more, that's sort of the extent of my own experience. I've learned a lot about AXI--just not so much about Zynq (yet).