cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
cferr
Visitor
Visitor
1,355 Views
Registered: ‎02-21-2018

Cosimulation and memory aliasing

Hi,

I am encountering an issue with co-simulation. It seems that HLS cosimulation doesn't account for address aliasing that happen in the testbench. In other words, writing to an address on one (outbound) port, then reading back at the same address on another (inbound) port gives the original value at that address, not the updated one. The issue seems to happen when different ports access the same address, not when the same port is used for both input and output.

Here are two pieces of code to show more precisely the issue. I define two functions : bidir_io_a (with two directed ports for array b, one for input and one for output), and bidir_io_b (with a single bidirectional port for b). Synthesizing bidir_io_a and running cosimulation gives the erroneous behavior, whereas bidir_io_b gives the expected result.

  • io.cpp 
#include "io.h"

void fpga_memcpy(ap_uint<32>* dst, ap_uint<32>* src, unsigned length) {
#pragma HLS INLINE
	for(unsigned i = 0; i < length; i++) {
#pragma HLS PIPELINE
		dst[i] = src[i];
	}
}

void bidir_io_a(ap_uint<32> a[32], volatile ap_uint<32> b_in[32], volatile  ap_uint<32> b_out[32])
{
#pragma HLS INTERFACE m_axi port=b_in depth=8
#pragma HLS INTERFACE m_axi port=b_out depth=8

	ap_uint<32> a_local[8];
	ap_uint<32> b_local[8];

	for(unsigned int i = 0; i < 16; i++) {
		for(unsigned short j = 0; j < 4; j++) {
			fpga_memcpy(a_local, a + (j << 3), 8);
			fpga_memcpy(b_local, (ap_uint<32>*)(b_in) + (j << 3), 8);
			for(unsigned short jj = 0; jj < 8; jj++) {
				b_local[jj] += a_local[jj];
			}
			fpga_memcpy((ap_uint<32>*)(b_out) + (j << 3), b_local, 8);
		}
	}


}

void bidir_io_b(ap_uint<32> a[32], volatile ap_uint<32> b[32])
{
#pragma HLS INTERFACE m_axi port=b depth=8

	ap_uint<32> a_local[8];
	ap_uint<32> b_local[8];

	for(unsigned int i = 0; i < 16; i++) {
		for(unsigned short j = 0; j < 4; j++) {
			fpga_memcpy(a_local, a + (j << 3), 8);
			fpga_memcpy(b_local, (ap_uint<32>*)(b) + (j << 3), 8);
			for(unsigned short jj = 0; jj < 8; jj++) {
				b_local[jj] += a_local[jj];
			}
			fpga_memcpy((ap_uint<32>*)(b) + (j << 3), b_local, 8);
		}
	}


}
  • io.h 
#ifndef _IO_H
#define _IO_H

#include <ap_int.h>

void bidir_io_a(ap_uint<32> a[32], volatile ap_uint<32> b_in[32], volatile  ap_uint<32> b_out[32]);
void bidir_io_b(ap_uint<32> a[32], volatile ap_uint<32> b[32]);

#endif //_IO_H
  • Testbench : main.cpp
#include <iostream>
#include <cstring>
#include <ap_int.h>
#include "io.h"

using namespace std;


int main(void)
{
	ap_uint<32>* a = (ap_uint<32>*)malloc(32 * sizeof(ap_uint<32>));
	ap_uint<32>* b = (ap_uint<32>*)malloc(32 * sizeof(ap_uint<32>));

	for(unsigned i = 0; i < 32; i++) {
		a[i] = i;
		b[i] = 0;
	}

	bidir_io_a(a, b, b);
	// bidir_io_b(a, b);

	int ret = 0;

	for(unsigned j = 0; j < 32; j++){
		if(b[j] != (a[j] << 4)) {
			std::cout << "Error at j = " << j << " b = " << b[j] << " a = " << a[j] << std::endl;
			ret = 1;
			break;
		}
	}

	free(a);
	free(b);

	return ret;
}

 

The issue is also visible in the resulting waveform:Waveform for erroneous test : reads and writes to array b are shown in red.Waveform for erroneous test : reads and writes to array b are shown in red.

 

As the above figure shows, the co-simulator doesn't take into account that the same array (defined in the testbench as b) gets read from and written to through those two arguments. Instead, reading from port b_in always gives the initial values contained in array b (i.e. zeros).

Use cases where such an issue arises are for users of SDSoC who wish to update an array and read it back later on. Function bidir_io_b wouldn't be synthesized through SDSoC as in+out ports like argument b are not supported AFAIK, hence the use of two unidirectional ports as in bidir_io_a a workaround.

Does this issue arise because of a limitation in the co-simulation engine, or because my SDSoC-compliant designs (that turn out to work on the board) are violating any design I/O principle?

Thanks,

 - Corentin

0 Kudos
7 Replies
xilinxacct
Instructor
Instructor
1,341 Views
Registered: ‎10-23-2018

@cferr

If you don't use malloc/free, do you get what you want?

Hope that helps

If so, please mark as solution accepted. Kudos also welcomed. :-)

0 Kudos
cferr
Visitor
Visitor
1,338 Views
Registered: ‎02-21-2018

@xilinxacct, thanks for your response.

However, getting rid of malloc() / free() doesn't seem to help, as co-simulation still fails in the bidir_io_a case.

Could you please explain why not using dynamic allocation would help co-simulation?

If that can help, I use sds_alloc() with SDSoC, that is not available in Vivado HLS simulation, and I replace it by malloc() there.

0 Kudos
xilinxacct
Instructor
Instructor
1,328 Views
Registered: ‎10-23-2018

@cferr

I cannot say for sure it would actually do anything different, but as I am sure you know, there are certain things that are not synthsizeble, and malloc is one of them. You can indeed use them in the testbench side, but since you are sending the pointer across the boundary, I just tend to avoid such things, by habit. :-) 'maybe' it is ok.

0 Kudos
cferr
Visitor
Visitor
1,321 Views
Registered: ‎02-21-2018

@xilinxacct You're right for the testbench synthesis restrictions. Unfortunately this doesn't look like the source of my issue.

FYI: Before I started with SDSoC, I used to pass pointers (allocated in a reserved memory space by a dedicated allocator) through a memory-mapped slave to the accelerator, that would then run its own DMA. I have never encountered any issues in doing that so far, whether I did it in HLS or RTL.

I don't think that co-simulation has issues with dynamic allocation in the testbench as well, since the bidir_io_b case works well with co-simulation and dynamic allocation...

Any more ideas?

  - Corentin

0 Kudos
xilinxacct
Instructor
Instructor
1,310 Views
Registered: ‎10-23-2018

@cferr

I vaguely recall some restrictions on aliasing... but I can't immediately find the reference, so take that with a grain of salt. At the moment, I can't break away to look into it further.

 

0 Kudos
xilinxacct
Instructor
Instructor
1,279 Views
Registered: ‎10-23-2018

@cferr

Is there a reason you 'need' to do the aliasing? I just ran your code in the cosimulator, and sent the results back into b_in...

//fpga_memcpy((ap_uint<32>*)(b_out) + (j << 3), b_local, 8);
fpga_memcpy((ap_uint<32>*)(b_in) + (j << 3), b_local, 8);

And your testbench passed.

As you you saying the 'b' version worked... one difference is the 2 vs 1 axi ports... It seems like the alias has lots of chances of being fooled.

Hope that helps

If so, please mark as solution accepted. Kudos also welcomed. :-)

0 Kudos
cferr
Visitor
Visitor
1,209 Views
Registered: ‎02-21-2018

Thanks @xilinxacct for your answers so far. Yes, I see at least two reasons to do that kind of aliasing:

1. SDSoC does not seem to support memory ports used both ways. Unless I'm doing it wrong, I need to set two unidirectional ports to read an array and write back into it.

2. Any iterative process where

  • input data is too big to fit into a local memory of the accelerator
  • data needs to be processed several times, iteratively
  • each iteration depends on the previous one being completed

will at some point have to use the same array for both reading and writing, and that array will be in an external memory large enough to hold it.

Then, I don't know how to do an iterative process as described in point 2 using SDSoC without using the aliasing hack to get past the single-direction port limitation. (Know a better way to achieve this? I'm open to suggestions :))

Another use of aliasing, but possibly hacky: I'm playing with the #pragma HLS DATAFLOW directive, and it turns out we can't write in an array if it's being read at some earlier point in the dataflow (my guess is this is a conservative behavior to ensure no inter-iteration data dependence exists that could be broken by writing too late -after the next iteration has started- and so pipelining becomes possible). A hack to allow this anyway is to use an aliased pointer when calling the accelerator: then, HLS does not notice the hack and updating the array becomes possible.

 

I ended up writing a co-simulation-friendly version of the testbench, that acts as if the array was not actually written to between iterations, and I check my results in this degraded mode (I can only check one iteration, though).

0 Kudos