UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

vcu1525 maximum dram bandwidth

Reply
Highlighted
Visitor
Posts: 6
Registered: ‎06-12-2018

vcu1525 maximum dram bandwidth

Hi guys,

I was using vcu1525 for a memory intensive app. I found some performance issue with DRAM bandwidth (kernel access global memory). Then I tried the github example kernel_global_bandwidth, when I disable multi-banks, the aggregate bandwidth (RD plus WR) is around 11GB/s. According to the profile summary is close to 100%. When I enable 4DDR banks, the bandwidth per bank (actually per channel) is even worse, around 4GB/s. I wonder why that is the case. Does it means the single direction bandwidth (RD or WR) per bank is limited by 5-6GB/s?

While for vcu1525 hardware, the ideal DRAM bandwidth per channel should be 2400*64 Mb/s which is around 19GB/s. From the implementation of the github example design. It use 16 burst length which seems the maximum with 512 bits data width, the average latency for each burst is around 200-300ns. And this is pure sequential memory access pattern.

Currently our board is not available yet. I haven't done on-board test. But from my previous experience, the hw simulation should be close to the real performance. I just want to confirm the numbers. It seems either the memory controller or the AXI bridge doesn't do a good job, and the read/write data path is somehow hard separated. Will multiplex more AXI bridges perform better? I guess I will try that first.

 

Anybody has the same experience?

Thanks for any comments and feedback.

Mian

Visitor
Posts: 6
Registered: ‎06-12-2018

Re: vcu1525 maximum dram bandwidth

[ Edited ]

Some updates for my further test..

1, I have done several bandwidth test with multiple kernels to achieve maximum bandwidth. The best overall throughput I can get is around 40GB/s (aggregate RD and WR). Something like this:

_______            ____           ______             ____         ______             ____           ______             ____  

|          |  ----->   |       |          |          |  ----->   |       |        |          |  ----->   |       |          |          |  ----->   |       |

|BANK|             |FIFO|         |BANK|             |FIFO|       |BANK|             |FIFO|         |BANK|             |FIFO|

|    0    |  <-----   |       |          |    1    |  <-----   |       |        |    2    |  <-----   |       |          |    3    |  <-----   |       |

|______|             |___ |          |_____|             |___ |        |_____|             |___ |          |_____|             |___ |

2, Since the workload is basically a memory copy, it has read write dependency. So, for each dataflow, the write bandwidth is determined by read bandwidth. So we only look into read here.

According to Xilinx PG150, the memory controller performance is like this:

Capture.PNG

I bet the sequential read/write is long burst (BL=64, i.e. 4KB) and the burst length will affect the throughput.

3, Back to our SDaccel toolchain, The number I measured seems make sense. The read bandwidth limit (4-5GB/s) for single channel seems caused by the default BL in SDaccel which is 16. If we can increase this BL to 64 we might push more close to the ideal DDR4 throughput.

 

By the way, it seems that FPGA DRAM bandwidth is not a big advantage compared to GPU and even CPU with SIMD.

Any comments or feedback are welcomed.

 

Thanks,

Mian

 

 

 

Visitor
Posts: 6
Registered: ‎06-12-2018

Re: vcu1525 maximum dram bandwidth

More following updates.

 

Last time I was wrong about read bandwidth is limited by 4-5GB/s per channel due to burst length. I do more test on the memory bandwidth (with the SDaccel toolchains) with different memory access pattern. All the following test are in a single channel.

1) 100% Read (sequential)

I varied BL from 2 to 64. The throughput performance scale from 9.15 to 12.2 GB/sec.

2) 100% Write (sequential)

Also varied BL from 2 to 64. The throughput performance is flat at 2GB/s

3) 50% Read, 50% Write (sequential)

The aggregate throughput performance slightly scales from 10.5 to 11.4 GB/s

 

The results doesn't make sense to me. Why 100% write performance is even lower than the mix workload? For the AXI master, I only set the  max_read/write_burst_length, other parameters are defaults.

Is this a AXI master scheduler problem? Or a simulation tool problem?

 

Thanks,

Mian