cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Adventurer
Adventurer
1,007 Views
Registered: ‎07-08-2019

Describe Off-chip Memory with Specific Bandwidth

Jump to solution

Hi,

In implementation of a module in Vivado HLS using C++, the input data of my top-level module are stored on off-chip memory (e.g. some sort of DDR). I load off-chip data part-by-part into on-chip memories which are of type BRAMs. After processing the data in BRAM, the result which is already in BRAM, stored back into off-chip memory and another part of off-chip input data is loaded and so on.

void my_top_module(
	float din[10][20][100],
	float dout[10][20][100],
	...
)
{
	float A[20][100], B[20][100];
	#pragma HLS array_partition variable=A   complete dim=0
	#pragma HLS array_partition variable=B complete dim=0
	#pragma HLS RESOURCE variable=A core=RAM_2P_BRAM
	#pragma HLS RESOURCE variable=B core=RAM_2P_BRAM

	for (int i=0; i<10; i++) {
		load din[i] into A[i];
		process(A[i]) with results in B[i] elements.
		store B[i] into dout[i];
	}
}

My module frequency is 100 MHz.

I expect the off-chip memory bandwidth to be near (and no more than) an specific amount, for exmple 1GB/s or 1.5 GB/s.

How can I do this? What pragmas and interfaces must be used? Is there any need to some specific loop structure to perform load(A) and store(B) operations?

Thanks in advance,

Ali

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Advisor
Advisor
973 Views
Registered: ‎04-26-2015

HLS doesn't know or care about external memory bandwidth. It just knows that there's something on the AXI bus that it can read from (or write to). For throughput calculations it assumes that it'll always be able to read/write immediately, although the synthesized hardware is happy to wait if that's required.

 

For loading data into A, the standard approach would be either memcpy or a pipelined loop:

for (int i = 0; i < LENGTH_OF_A; i++) {
#pragma HLS PIPELINE II=1 A[i] = din[x][y][i]; }

If you really need throughput, you can potentially partition A and perform wider reads. For example, you might read four 32-bit elements at once, which at 100MHz will be 1.6GB/s. However, keep in mind that this increases hardware within the block (because A will require a wider BRAM) and also increases resources for all your AXI infrastructure (because you're now using a 128-bit AXI bus).

 

Normally, the preferred option is to simply read one element per cycle. It's not the fastest, but it's normally fast enough - and it does greatly simplify the design.

View solution in original post

5 Replies
Highlighted
Advisor
Advisor
974 Views
Registered: ‎04-26-2015

HLS doesn't know or care about external memory bandwidth. It just knows that there's something on the AXI bus that it can read from (or write to). For throughput calculations it assumes that it'll always be able to read/write immediately, although the synthesized hardware is happy to wait if that's required.

 

For loading data into A, the standard approach would be either memcpy or a pipelined loop:

for (int i = 0; i < LENGTH_OF_A; i++) {
#pragma HLS PIPELINE II=1 A[i] = din[x][y][i]; }

If you really need throughput, you can potentially partition A and perform wider reads. For example, you might read four 32-bit elements at once, which at 100MHz will be 1.6GB/s. However, keep in mind that this increases hardware within the block (because A will require a wider BRAM) and also increases resources for all your AXI infrastructure (because you're now using a 128-bit AXI bus).

 

Normally, the preferred option is to simply read one element per cycle. It's not the fastest, but it's normally fast enough - and it does greatly simplify the design.

View solution in original post

Highlighted
Adventurer
Adventurer
934 Views
Registered: ‎07-08-2019

Hi,

Thanks @u4223374 for your response.

But, if I want to model a real off-chip memory (e.g., DDR2 with specific B.W.) then what can I do?

I mean, how can I specify an exact amount of bandwidth or an upper bound on it?

In other words, I need off-chip communications with higher bandwidth and as you said, I can achieve it by partitioning arrays of on-chip memories and performing more parallel load-store operations. But, it is not unlimited and I need an upper limit on bandwidth during my simulations and experiments.

Thanks,

Ali

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
914 Views
Registered: ‎09-04-2017

Hi Ali,

  If you running just with HLS, i don't think we have a way to mimic the external interfaces. you can use SDAccel flow which might serve your purpose. Use HLS to generate the xo and import it in SDAccel environment. you can profile your code there to see how much is the latency/bandwidth that IO transactions and your kernel will take.

Thanks,

Nithin

Highlighted
Advisor
Advisor
893 Views
Registered: ‎04-26-2015

@akokha In HLS, all you can do is design your system so it will definitely stay within the available bandwidth. For example, with 1GB/s bandwidth and a 100MHz clock, you might choose to do 64-bit transfers - which will use 800MB/s.

Highlighted
Visitor
Visitor
199 Views
Registered: ‎09-26-2020

Hi,

I had 2 questions:

1)How do we connect the external DDR with out custom IP? I mean, when we import the custom IP into Vivado Design Suite how are the connections made ? Do we need to put a DDR controller ?

2) Lets say the memory is word addressable. Each word size if 32bits. Memory Bandwidth is 3Gbps. How can we read 4 32 bit words in one cycle, do we need to partition the custom IP memory port to 4 memory ports ?

I am a bit confused with off-chip memory access.

0 Kudos