cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
saltedfishlz
Observer
Observer
2,209 Views
Registered: ‎09-01-2017

Why ‘copy’ pragma only transfer data less than 8MByte?

Jump to solution

Hello, I am dealing with a project about Convolution Neural Network.
I used 'copy', 'AXI_DMA simple'and 'sequential'. I think that will instruct IDE to generate a FIFO-based interface.
However, from document 1027, we can see that this kind of interface can only transfer data less than 8 MB. It means that I can not finish processing a single layer in CNN during a single call of hardware function,  even if data can ' flow ' through the hardware function.
I wonder why this restriction of data size is necessary, since FIFO-like interface can work nonstop.
Also, I want to handle the problem. How can I get more than 8 MB data (exactly,  >= 8 MB ) in a single call of hardware function?
I know that pragma 'ZERO_COPY' can deal with this. Howerver, it seems to slow to copy data from ZC706's shared memory from PS side to PL side, and its latency is too high.

捕获.PNG
0 Kudos
1 Solution

Accepted Solutions
hbucher
Scholar
Scholar
2,852 Views
Registered: ‎03-22-2016

@saltedfishlz

SDSOC uses Vivado inside. What you want perhaps to do is to create a custom platform.

Check page 146 on UG1146

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_2/ug1146-sdsoc-platform-development.pdf

 

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.

View solution in original post

Tags (1)
16 Replies
hbucher
Scholar
Scholar
2,199 Views
Registered: ‎03-22-2016

@saltedfishlz

8M is the hardware limit. 

I think it is defined somewhere in xaxidma.h or xaxidma_hw.h

 

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.
0 Kudos
jhwang
Xilinx Employee
Xilinx Employee
2,197 Views
Registered: ‎07-13-2009

Precisely as Henry explains.  And to be clear, the SG-DMA does not have this transfer size restriction.

saltedfishlz
Observer
Observer
2,189 Views
Registered: ‎09-01-2017

Can I break this limit manually?

Because I think DMA_SG is too slow.

By the way, may I ask the reason why ZC706's DRAM memory on PL side seems faster than that on the PS side?

Also, I wonder why ZC706 uses SODIMM DRAM for FPGA and component DRAM for ARM CPU, while ZCU102 es2 uses SODIMM DRAM for ARM CPU and component DRAM for FPGA.

 

If I want to realize the MAXIMUM bandwidth for data transfer, what should I do?

0 Kudos
hbucher
Scholar
Scholar
2,189 Views
Registered: ‎03-22-2016

@jhwang 

SGDMA - I believe the individual BDs are still subject to the limit, is that correct?

 

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.
0 Kudos
hbucher
Scholar
Scholar
2,188 Views
Registered: ‎03-22-2016

@saltedfishlz The GP/HP ports are much slower than what DDR3/4 can deliver. If I remember well each HP port is 256 Mbps while DDR can go 5 Gbps.

So try to keep everything "fast" and streaming on the PL side

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.
0 Kudos
saltedfishlz
Observer
Observer
2,183 Views
Registered: ‎09-01-2017

Did you mean that I should use 'MIG' if I have huge 'static' data (data that won't change when application is running, for example, weight matrix in CNN forward inference)?

If I use MIG, how to initialize the data in the PL side memory? Use zero_copy? 

0 Kudos
saltedfishlz
Observer
Observer
2,183 Views
Registered: ‎09-01-2017

I wonder how do the most experienced engineers use SoC with ARM processor in CNN applications.

Should I use ARM CPU to continuously copy data to PL ? I think this method can reduce latency as well as FPGA BRAM consumption compared with ZERO_COPY. Is that right?

 

0 Kudos
hbucher
Scholar
Scholar
2,163 Views
Registered: ‎03-22-2016

@saltedfishlz

Yes, that is why you have DDR on the PL side. The Zynq is primarily for very high level, serial algos that benefit from runing at a much higher frequency. 

With HBM (high bandwidth memory) this all will be past though.

https://forums.xilinx.com/t5/Xcell-Daily-Blog/Xilinx-Virtex-UltraScale-FPGAs-incorporate-32-or-64Gbits-of-HBM/ba-p/732029

 

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.
0 Kudos
hbucher
Scholar
Scholar
2,150 Views
Registered: ‎03-22-2016

@saltedfishlz Initialization: I have never had the need to do it really, especially with zeros. Because when you use the memory you usually overwrites it anyway.

But if you need to initialize data structures, you can have a global register (static bool initialized = false) and check it when the HLS function is executed. If it was not initialized then you proceed in a loop setting values to zero.

Or you might create another HLS component "MemoryInitializer" and call it from the PS side when the program starts.

Or you can just straight set all to zero from the PS but even a loop from the PL side will take several seconds, depending on the size. 

 

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.
0 Kudos
hbucher
Scholar
Scholar
1,849 Views
Registered: ‎03-22-2016

@saltedfishlz

If you take away all the hype, CNNs are very very simple algorithms that can be easily coded in HLS. We do this all the time.

But typically we do the training outside on large clusters (Amazon etc) and transfer the weights to the PL DDR at boot time. 

So we only have the predict part of the algo in the device, which is just 20% of the entire AI codebase.

 

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.
0 Kudos
saltedfishlz
Observer
Observer
1,825 Views
Registered: ‎09-01-2017

Thank you for your kindness. I am a sophomore undergraduate in China. Some of my description may be inexact and ambiguous. I am sorry for that.

Actually, 'Initialization' is to load filter data in CNN from external storage or memory (for example, the PS side DRAM) to PL side DRAM.

By the way, to read data from PL side DRAM, is hierarchical memory system beneficial ?

Thanks a lot. (I was sleeping before)

0 Kudos
hbucher
Scholar
Scholar
1,817 Views
Registered: ‎03-22-2016

@saltedfishlz Beneficial in what terms? Speed? Yes.

You can read more about it here

https://www.xilinx.com/support/documentation/white_papers/wp485-hbm.pdf

 

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.
hbmbenefits.PNG
0 Kudos
saltedfishlz
Observer
Observer
1,807 Views
Registered: ‎09-01-2017

Senior, thank you very much.

Is there any tools for SDSoC and SDAccel IDE to generate MIG port for PL side?

I haven't seen any instruction in Xilinx's document for SDSoC

Thanks a lot

Zheng Liang, Peking University, China

hbucher
Scholar
Scholar
1,775 Views
Registered: ‎03-22-2016

@saltedfishlz It depends on your board. Some boards do not have any DDR on the PL side - I think the Zybo is an example.

On both the zc706 and zcu102 you can see the DDR memory on the Board panel (in Vivado). 

You can just drag it to your board design.

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.
ddr4drag.PNG
0 Kudos
saltedfishlz
Observer
Observer
1,770 Views
Registered: ‎09-01-2017

Oh, thanks a lot.

Actually, I haven't used Vivado for managing CPU. I learned digital logic last semester and only use Vivado for Verilog / VHDL design.

This function of Vivado seems user friendly(I hope so), are there any control protocol needed to be specified beforeuse this IP core ? How can I migrate the SDSoC project to Vivado ? 

I am deeply grateful.

0 Kudos
hbucher
Scholar
Scholar
2,853 Views
Registered: ‎03-22-2016

@saltedfishlz

SDSOC uses Vivado inside. What you want perhaps to do is to create a custom platform.

Check page 146 on UG1146

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_2/ug1146-sdsoc-platform-development.pdf

 

vitorian.com --- We do this for fun. Always give kudos. Accept as solution if your question was answered.
I will not answer to personal messages - use the forums instead.

View solution in original post

Tags (1)