06-12-2018 05:50 PM
I was using vcu1525 for a memory intensive app. I found some performance issue with DRAM bandwidth (kernel access global memory). Then I tried the github example kernel_global_bandwidth, when I disable multi-banks, the aggregate bandwidth (RD plus WR) is around 11GB/s. According to the profile summary is close to 100%. When I enable 4DDR banks, the bandwidth per bank (actually per channel) is even worse, around 4GB/s. I wonder why that is the case. Does it means the single direction bandwidth (RD or WR) per bank is limited by 5-6GB/s?
While for vcu1525 hardware, the ideal DRAM bandwidth per channel should be 2400*64 Mb/s which is around 19GB/s. From the implementation of the github example design. It use 16 burst length which seems the maximum with 512 bits data width, the average latency for each burst is around 200-300ns. And this is pure sequential memory access pattern.
Currently our board is not available yet. I haven't done on-board test. But from my previous experience, the hw simulation should be close to the real performance. I just want to confirm the numbers. It seems either the memory controller or the AXI bridge doesn't do a good job, and the read/write data path is somehow hard separated. Will multiplex more AXI bridges perform better? I guess I will try that first.
Anybody has the same experience?
Thanks for any comments and feedback.
06-13-2018 11:00 AM - edited 06-13-2018 11:05 AM
Some updates for my further test..
1, I have done several bandwidth test with multiple kernels to achieve maximum bandwidth. The best overall throughput I can get is around 40GB/s (aggregate RD and WR). Something like this:
_______ ____ ______ ____ ______ ____ ______ ____
| | -----> | | | | -----> | | | | -----> | | | | -----> | |
|BANK| |FIFO| |BANK| |FIFO| |BANK| |FIFO| |BANK| |FIFO|
| 0 | <----- | | | 1 | <----- | | | 2 | <----- | | | 3 | <----- | |
|______| |___ | |_____| |___ | |_____| |___ | |_____| |___ |
2, Since the workload is basically a memory copy, it has read write dependency. So, for each dataflow, the write bandwidth is determined by read bandwidth. So we only look into read here.
According to Xilinx PG150, the memory controller performance is like this:
I bet the sequential read/write is long burst (BL=64, i.e. 4KB) and the burst length will affect the throughput.
3, Back to our SDaccel toolchain, The number I measured seems make sense. The read bandwidth limit (4-5GB/s) for single channel seems caused by the default BL in SDaccel which is 16. If we can increase this BL to 64 we might push more close to the ideal DDR4 throughput.
By the way, it seems that FPGA DRAM bandwidth is not a big advantage compared to GPU and even CPU with SIMD.
Any comments or feedback are welcomed.
06-15-2018 09:12 AM
More following updates.
Last time I was wrong about read bandwidth is limited by 4-5GB/s per channel due to burst length. I do more test on the memory bandwidth (with the SDaccel toolchains) with different memory access pattern. All the following test are in a single channel.
1) 100% Read (sequential)
I varied BL from 2 to 64. The throughput performance scale from 9.15 to 12.2 GB/sec.
2) 100% Write (sequential)
Also varied BL from 2 to 64. The throughput performance is flat at 2GB/s
3) 50% Read, 50% Write (sequential)
The aggregate throughput performance slightly scales from 10.5 to 11.4 GB/s
The results doesn't make sense to me. Why 100% write performance is even lower than the mix workload? For the AXI master, I only set the max_read/write_burst_length, other parameters are defaults.
Is this a AXI master scheduler problem? Or a simulation tool problem?