cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
366 Views
Registered: ‎06-10-2019

Forcing writing of an 8-bit array to a 32-bit AXI to be done 32-bits at a time

Jump to solution

Let's assume I have an array of 8-bit values, and my kernel has a 32-bit AXI interface, like so:

void mythread(volatile uint32_t* axi) {
	#pragma HLS INTERFACE m_axi depth=50 port=axi
	#pragma HLS INTERFACE s_axilite port=return
	uint8_t arr[DATA_SZ];
}

 According to the HLS user guide (p. 115), performing a burst transaction on the interface requires either the use of memcpy, or of a pipelined for loop. Let's forget memcpy, since with Vivado HLS 2018.3, it does not seem to work with data that is not the same size as the interface, and, according to my tests, it seems to be perfectly identical to a pipelined for loop otherwise.

 

The problem

Now, if we try to perform a burst write of the 8-bit array to the 32-bit AXI interface by using a pipelined for loop, then, rather than reading the array 32-bits at a time, it reads it 8-bits at a time and performs a write on the interface only after four reads. I can't see any reason why the array should be read byte-by-byte, even if it is an 8-bit array. For example, if the array is implemented as a BRAM, then doing a 32-bit read on it is perfectly equivalent to doing four consecutive 8-bit reads, and is perfectly synthesizable.

Note that I also tested reading from the AXI interface to the array, and the same problem happens: 32-bits are read from the interface, then four 8-bit writes to the array take place.

 

The question

Assuming I know what I'm doing, is there a way to force Vivado HLS to implement a 32-bit read on the array? I've tried implementing the transfer inside the pipelined for loop using a C cast, a reinterpret_cast, as well as by moving the data to a temporary 32-bit buffer.

 

Test with Vivado HLS 2018.3

The minimal self-contained .cpp file at the end of this question copies the data from the array to the interface. I then synthesize this for a zynq7000 (xc7z020clg484-1, to be exact), however, as shown by the schedule viewer, Vivado HLS indeed implements each 32-bit transfer as four 8-bit reads of the array followed by one 32-bit write on the interface. This is problematic, because only two of these four reads can be done per cycle if the array is implemented as a BRAM, since the BRAM only has two ports. This causes the pipelined for loop to have an initiation interval of 2 cycles, rather than 1, meaning the AXI interface is doing nothing 50% of the time, essentially doubling the transfer time.

8bx4w.jpg

Here is the complete .cpp file I used for this test:

#include <inttypes.h>

typedef uint32_t axi_data_t;
typedef uint8_t data_t;
#define NUM_32b_ELEMS 128
#define DATA_SZ (NUM_32b_ELEMS * (sizeof(axi_data_t) / sizeof(data_t)))

struct Foo {
	uint32_t offset;
	data_t arr[DATA_SZ];
};

void mythread(volatile axi_data_t* axi) {
	#pragma HLS INTERFACE m_axi depth=50 port=axi
	#pragma HLS INTERFACE s_axilite port=return

	Foo foo = {};
	uint32_t addr_base = 0;

	// Initialize array from memory (initializing it with
	// e.g. the index makes Vivado HLS instantiate the
	// array as a ROM, when we really want a BRAM for this test).
	dummy: for (unsigned int i = 0; i < NUM_32b_ELEMS; ++i) {
		#pragma HLS pipeline
		((uint32_t*)foo.arr)[i] = axi[addr_base / sizeof(axi_data_t) + i];
	}

	thr: while (1) {
		wr: for (unsigned int i = 0; i < NUM_32b_ELEMS; ++i) {
			#pragma HLS pipeline
			axi_data_t data = ((uint32_t*)foo.arr)[i];
			axi[addr_base / sizeof(axi_data_t) + i] = data;
		}
		addr_base += sizeof(axi_data_t) * (sizeof(foo.arr) / sizeof(axi_data_t));
	}
}
0 Kudos
1 Solution

Accepted Solutions
Scholar
Scholar
279 Views
Registered: ‎04-26-2015

Re: Forcing writing of an 8-bit array to a 32-bit AXI to be done 32-bits at a time

Jump to solution

I'm not sure I undestand this claim:

"For example, if the array is implemented as a BRAM, then doing a 32-bit read on it is perfectly equivalent to doing four consecutive 8-bit reads, and is perfectly synthesizable."

If the array is implemented as a block RAM, then HLS will have used a 2Kx9 block RAM unless you have explicitly told it otherwise (eg. via array_reshape, which comes with its own limitations). HLS does not do different widths for different ports; if it's 2Kx9 then you get two 9-bit ports. This obviously cannot provide one 32-bit element per clock cycle.

 

The standard solutions are to use array_reshape (and accept that writing one byte to the RAM will now require a read-modify-write cycle) or use a ~16x32-bit buffer for the AXI burst (which doesn't get you any more throughput, but does mean that you're not occupying the AXI bus for so long).

 

 

View solution in original post

1 Reply
Scholar
Scholar
280 Views
Registered: ‎04-26-2015

Re: Forcing writing of an 8-bit array to a 32-bit AXI to be done 32-bits at a time

Jump to solution

I'm not sure I undestand this claim:

"For example, if the array is implemented as a BRAM, then doing a 32-bit read on it is perfectly equivalent to doing four consecutive 8-bit reads, and is perfectly synthesizable."

If the array is implemented as a block RAM, then HLS will have used a 2Kx9 block RAM unless you have explicitly told it otherwise (eg. via array_reshape, which comes with its own limitations). HLS does not do different widths for different ports; if it's 2Kx9 then you get two 9-bit ports. This obviously cannot provide one 32-bit element per clock cycle.

 

The standard solutions are to use array_reshape (and accept that writing one byte to the RAM will now require a read-modify-write cycle) or use a ~16x32-bit buffer for the AXI burst (which doesn't get you any more throughput, but does mean that you're not occupying the AXI bus for so long).

 

 

View solution in original post