UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Visitor gdeest
Visitor
3,067 Views
Registered: ‎01-22-2015

Burst access latency

Hi,

 

I am trying to better understand the factors influencing burst access latency, when using either the ACP or the AFI (HP) ports.

 

To this aim, I have written two versions of an IP that performs a given number of read or write accesses of the same length in a given memory region of 2048 bytes. The code of the first version is:

 

#define ARR_LEN 256

#define K 677

data_t buff_read[256];
data_t buff_write[256];

 

int do_reads(data_t *arr_read, int len_read, int nread) {
    for (int i=0; i<nread; i++) {
        memcpy(buff_read, arr_read + ((K*i) % (ARR_LEN-256)), len_read*sizeof(data_t));
    }

    return buff_read[0];  // Read in buff_read to prevent dead code elimination
}

int do_writes(data_t *arr_write, int len_write, int nwrite) {
    for (int i=0; i<nwrite; i++) {
        memcpy(arr_write + ((K*i) % (ARR_LEN-256)), buff_write, len_write*sizeof(data_t));
    }

    return 0;
}

 

The second version is a slight variant on the first one, where memory accesses are performed by an un-inlined function, to prevent Vivado HLS from fusing everything and make the schedule easier to understand:

 

#define ARR_LEN 256

#define K 677

data_t buff_read[256];
data_t buff_write[256];

 

void do_read(data_t *addr, size_t n) {
#pragma HLS INLINE off
    memcpy(buff_read, addr, n);
}

void do_write(data_t *addr, size_t n) {
#pragma HLS INLINE off
    memcpy(addr, buff_write, n);
}

int do_reads(data_t *arr_read, int len_read, int nread) {
    data_t *addr = arr_read;

 

    for (int i=0; i<nread; i++) {
        do_read(addr, len_read*sizeof(data_t));
        addr = arr_read + ((K*i) % (ARR_LEN)-256);
    }

    return buff_read[0];
}

int do_writes(data_t *arr_write, int len_write, int nwrite) {
    data_t *addr = arr_write;

 

    for (int i=0; i<nwrite; i++) {
        do_write(addr, len_write*sizeof(data_t));
        addr = arr_write + ((K*i) % (ARR_LEN)-256);
    }

    return 0;
}

 

The two IPs share the same toplevel function, which allows me to perform concurrent read/write accesses although, for the time being, I am only concerned with just reads and no writes, or the way around.

 

int do_exp(data_t *arr_read, data_t *arr_write, int len_read, int len_write, int nread, int nwrite) {
#pragma HLS DATAFLOW

    for (int i=0; i<256; i++) {
        buff_write[i] = 1;
    }

    int ret_read = do_reads(arr_read, len_read, nread);
    int ret_write = do_writes(arr_write, len_write, nwrite);

    return ret_read + ret_write;
}

 

Using these two IPs, I have tried to estimate the latencies of bursts. To do this, I run a large number of bursts of a given size, measure the average number of CPU cycles required to complete all the accesses, and compute the latency, in terms of FPGA cycles, with the following formula:

 

FPGA_CYCLES = (CPU_CYCLES * FPGA_FREQ) / (CPU_FREQ * NUMBER_OF_BURSTS) - BURST_LENGTH.

 

(With CPU_FREQ = 800 in my case, as I am running my experiments on the ZC706 board).

 

Using this setup, I get the following results:

 

IP Version Port Freq Burst Length Read Latency Write Latency
1 ACP 142.86 2 6.9 8.3
1 ACP 142.86 256 27 92.8
1 ACP 166.67 2 7.1 9.1
1 ACP 166.67 256 28.3 65.6
1 HP 142.86 2 9.9 7.8
1 HP 142.86 256 45 71.3
1 HP 166.67 2 10.6 8.3
1 HP 166.67 256 40.8 69.3
2 ACP 142.86 2 37.3 32.8
2 ACP 142.86 256 38 87.8
2 ACP 166.67 2 39.4 34.8
2 ACP 166.67 256 40.1 89.1
2 HP 142.86 2 50.7 38.58
2 HP 142.86 256 52.2 92.8
2 HP 166.67 2 54.3 41.1
2 HP 166.67 256 57.5 95.8

 

 

To summarize my findings:

 

- On ACP, writes are always slower than reads.

- On HP, small writes are faster than small reads, but large reads are faster than large writes.

- In the first version of the IP, latency for small reads and writes is much smaller than in the second version. I suppose it can be explained by the kind of optimizations Vivado HLS can perform when fusing all the memcpy()'s in a single loop, but I am at a loss to understand what is exactly going on.

- I would have expected burst latency to be roughly inversely proportional to FPGA frequency, as most of it should come from external factors. That does not seem to be fully the case (although we do observe higher latencies at higher frequencies).

 

I guess most of the phenomena I observe have a pretty simple explanation. I'd be most grateful if someone could shed some light on this.

 

PS: I am using Vivado HLS through SDSoC. The data motion network is synthesized at the same frequency as the IP, although I wouldn't expect it to play a major role since they're connected to the ACP port.

 

Thanks in advance,

 

Gaël Deest

 

 

0 Kudos