UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Observer varun.nagpal
Observer
336 Views
Registered: ‎11-20-2018

Parallel memcpy

Is it possible to overlap two memory copies between fpga and ddr if they are done on two separate m_axi (HP) ports ?

Seems like I need task parallelism here. I tried putting DATAFLOW directive at beginning of my top level function, but it has no effect.

 

#include "memory.h"
#include "ap_int.h"
#define UNROLL_FACTOR 40
#define DATA_SZ (UNROLL_FACTOR * 100)

void example(volatile ap_uint<16>* a, volatile ap_uint<16>* b) {
  const int unroll_factor = UNROLL_FACTOR;

#pragma HLS INTERFACE m_axi depth=unroll_factor port=a bundle=hp0
#pragma HLS INTERFACE m_axi depth=unroll_factor port=b bundle=hp1
#pragma HLS INTERFACE s_axilite port=return

  // Port a and b are assigned to an AXI4 master interface
  int i;
  ap_uint<16> buffa[DATA_SZ];
#pragma HLS ARRAY_PARTITION variable = buffa cyclic factor = 40 dim = 1
  ap_uint<16> buffb[DATA_SZ];
#pragma HLS ARRAY_PARTITION variable = buffb cyclic factor = 40 dim = 1

  // memcpy creates a burst access to memory
  memcpy(buffa, (const ap_uint<16>*)a, DATA_SZ * sizeof(ap_uint<16>));
  memcpy(buffb, (const ap_uint<16>*)b, DATA_SZ * sizeof(ap_uint<16>));

  for (i = 0; i < DATA_SZ; i++) {
#pragma HLS PIPELINE
#pragma HLS UNROLL factor = unroll_factor
    buffa[i] = buffa[i] * 100;
    buffb[i] = buffb[i] * 100;
  }

  memcpy((ap_uint<16>*)a, buffa, DATA_SZ * sizeof(ap_uint<16>));
  memcpy((ap_uint<16>*)b, buffb, DATA_SZ * sizeof(ap_uint<16>));
}

memcpy_parallel.png

 

0 Kudos
2 Replies
Highlighted
Observer varun.nagpal
Observer
315 Views
Registered: ‎11-20-2018

Re: Parallel memcpy

Ok so I figured how to do it. Below is the modified code. I am able to half the overall design latency now from 16k to 8k.

#include "memory.h"
#include "ap_int.h"
#define UNROLL_FACTOR 40
#define DATA_SZ (UNROLL_FACTOR * 100)

void example(volatile ap_uint<16>* streamA, volatile ap_uint<16>* streamB) {
#pragma HLS DATAFLOW
  const int unroll_factor = UNROLL_FACTOR;

#pragma HLS INTERFACE m_axi depth=unroll_factor port=streamA bundle=hp0 offset=slave
#pragma HLS INTERFACE m_axi depth=unroll_factor port=streamB bundle=hp2 offset=slave
#pragma HLS INTERFACE s_axilite port=return

  int i;
  ap_uint<16> buffA[DATA_SZ];
#pragma HLS ARRAY_PARTITION variable=buffA cyclic factor=40 dim=1
  ap_uint<16> buffB[DATA_SZ];
#pragma HLS ARRAY_PARTITION variable= buffB cyclic factor=40 dim=1

  {
    memcpy(buffA, (const ap_uint<16>*)streamA, DATA_SZ * sizeof(ap_uint<16>));

  stream1_process_loop:
    for (i = 0; i < DATA_SZ; i++) {
#pragma HLS PIPELINE
#pragma HLS UNROLL factor=unroll_factor
      buffA[i] = buffA[i] * 100;
    }

    memcpy((ap_uint<16>*)streamA, buffA, DATA_SZ * sizeof(ap_uint<16>));
  }

  {
    memcpy(buffB, (const ap_uint<16>*)streamB, DATA_SZ * sizeof(ap_uint<16>));

  stream2_process_loop:
    for (i = 0; i < DATA_SZ; i++) {
#pragma HLS PIPELINE
#pragma HLS UNROLL factor=unroll_factor
      buffB[i] = buffB[i] * 100;
    }

    memcpy((ap_uint<16>*)streamB, buffB, DATA_SZ * sizeof(ap_uint<16>));
  }
}

memcpy_parallel_better.png

 

memcpy_parallel_better1.png

Now the resource usage for FF and LUT has doubled which is expected. 

Observer jonbho
Observer
122 Views
Registered: ‎10-23-2019

Re: Parallel memcpy

Thanks for posting the solution yourself, useful to anybody else looking. If this is something more than a learning experiment, I though you might want to consider reading from DDR manually (not using memcpy, but reading elements), and multiplying elements as they come. HLS sysnthesis should be able to generate exactly the same burst mode reads reading memory that way instead of using memcpy, multiplication time may overlap with DDR reading time, and you will do away with needing so many BRAM blocks to have a fully partitioned array, since reads/writes will be sequential and thus ok with just regular dual-ported BRAM. Apart from saving on resources, you may end up with a faster calculation. Of course, make sure you check all these assumptions while you do the changes!

0 Kudos