We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Showing results for 
Search instead for 
Did you mean: 
Registered: ‎09-27-2018

No Kernel write transfer to global memory


I have written a C++-Kernel which uses the Xilinx FFT IP core, modelled with the Vivado HLS libraries. I am transferring 1024 32-bit samples in and taking 1024 32-bit samples out, each on a separate AXI . In sw_emu, everything works fine. In hw_emu, the core is started and 4KB is read from global memory via the first AXI. running it for a ridiculously long time, no write to global memory on the second AXI ever occurs. In Vivado HLS, cosim never finishes and does not output a thing.


My kernel code:


#pragma once
#include "/opt/xilinx-2018/Vivado/2018.2/include/gmp.h" //included due to known workaround for cosimulation of Vivado HLS
#include "ap_fixed.h"
#include "hls_fft.h"
#include <complex>

#include "fft_common.h"


//re definition as const chars/ints for pragma resolution
const char _NUM_BITS 		  = NUM_BITS; // precision for the output data
const int  _FFT_LOG 		  = FFT_LOG;
const int  _FFT_LENGTH 		  = (1 << FFT_LOG);

static const unsigned ORDERING_OPT = hls::ip_fft::natural_order; //choose hls::ip_fft::bit_reversed_order optionally

//pre-defined by xilinx fft ip core
const char CONFIG_WIDTH			= 16;

#define FORWARD 1  //compare Xilinx PG109, page 15
#define INVERSE 0

#define FFT_SCALING_SCHEDULE 682 // corresponds to the "most conservative" schedule from Xilinx PG109, page 46

const unsigned OUTPUT_MULTIPLIER = 1024; //value chosen due to scaling schedule.

using namespace std;

 * static configuration struct for the Xilinx FFT IP core
struct config1 : hls::ip_fft::params_t {
    static const unsigned ordering_opt = ORDERING_OPT;
    static const unsigned config_width = CONFIG_WIDTH;
    static const unsigned input_width  = NUM_BITS;
    static const unsigned output_width = NUM_BITS;
    static const unsigned max_nfft	   = FFT_LOG;

typedef hls::ip_fft::config_t<config1> config_t;
typedef hls::ip_fft::status_t<config1> status_t;

typedef ap_fixed<PRECISION> fft_type_t;

#include <cstdio>
#include <iostream>
#include <cstring>

#define NUM_INPUT_BLOCKS 128

int hann[_FFT_LENGTH]; //initialization left out for post length reduction

int local_buffer[_FFT_LENGTH];

 * prepare fft buffers
complex<fft_type_t> xn_global[_FFT_LENGTH];
complex<fft_type_t> xk_global[_FFT_LENGTH];
config_t power_fft_config;

void fft_core(     complex<fft_type_t> xn[_FFT_LENGTH],
                 complex<fft_type_t> xk[_FFT_LENGTH],
                volatile int* a,
                int top_index){
#pragma HLS INTERFACE ap_fifo port=xn
#pragma HLS INTERFACE ap_fifo port=xk
#pragma HLS INTERFACE ap_fifo port=power_fft_config

            CONVERT_TO_COMPLEX_INPUT: for(int ii = 0; _FFT_LENGTH > ii; ++ii){
                fft_type_t x;
                x.range() =  a[top_index + ii];
                //fft_type_t hann_coeff;
                //hann_coeff.range() = hann[ii];
                //x *= hann_coeff;

            status_t power_fft_status;
         //   // apply actual fft
            hls::fft<config1>(xn, xk, &power_fft_status, &power_fft_config);


void fft_main( int * a,
                        int * b,
                        int num_blocks,
                        int num_overlap_samples,
                        long int threshold)

    // bundle all parameters and memory pointers into an AXI slave interface (for HW synthesis / production version)
    #pragma HLS INTERFACE m_axi depth=1024 port=a offset=slave bundle=gmem0
    #pragma HLS INTERFACE m_axi depth=1024 port=b offset=slave bundle=gmem1

    #pragma HLS INTERFACE ap_fifo port=xk_global
    #pragma HLS INTERFACE ap_fifo port=xn_global

    #pragma HLS INTERFACE s_axilite port=a bundle=control
    #pragma HLS INTERFACE s_axilite port=b bundle=control
    #pragma HLS INTERFACE s_axilite port=num_blocks bundle=control
    #pragma HLS INTERFACE s_axilite port=num_overlap_samples bundle=control
    #pragma HLS INTERFACE s_axilite port=threshold bundle=control
    #pragma HLS INTERFACE s_axilite port=return bundle=control

     long int accumulator = 0;

     unsigned int top_index = 0;
        //create num_input_blocks overlapped blocks and apply an fft to each
     WINDOWED_FFTS: for (unsigned int jj = 0; num_blocks > jj; ++jj){

        fft_core(xn_global, xk_global, a, top_index);
        //from fft results (I-Q), create a power spectrum
        CONVERT_TO_POWER: for(int kk = 0; FFT_LENGTH > kk; ++kk){
            fft_type_t real = xk_global[kk].real() * OUTPUT_MULTIPLIER; // shifting necessary due to scaling policy of fft core; squaring mostly yields 0 as result
            fft_type_t imag = xk_global[kk].imag() * OUTPUT_MULTIPLIER; // shifting necessary due to scaling policy of fft core
            fft_type_t power =  real*real + imag*imag;

            local_buffer[kk] = power.range();
            //b[kk + jj * _FFT_LENGTH] = power.range();
        top_index += _FFT_LENGTH - num_overlap_samples;
        memcpy(&b[jj * _FFT_LENGTH], local_buffer, sizeof(int) * _FFT_LENGTH);



Applying this code from the host with num_blocks=1 and num_overlap_samples=0, i.e. 1024 samples in (time domain) and 1024 samples out (freq domain), the kernel never writes to global memory. compiling this with TARGETS=hw, the build is successful, but the program gets stuck loading the kernel, so i guess something must be wrong with the kernel code.


Thanks for your help


Toolchain: Vivado HLS 2018.2, SDAccel 2018.2

0 Kudos
5 Replies
Xilinx Employee
Xilinx Employee
Registered: ‎03-24-2010

Re: No Kernel write transfer to global memory

Please use waveform viewer to debug your kernel, either use SDAccel or HLS. 

If you don't see your kernel complete successfully, you may focus on your kernel development to make sure it functions correctly. This stage, RTL cosim in HLS is quicker in iteration.

Kindly note- Please mark the Answer as "Accept as solution" if information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Registered: ‎09-27-2018

Re: No Kernel write transfer to global memory

Hi brucey,

thanks for your answer. As I wrote in the first block, I am already trying to use HLS RTL cosim do get the kernel running, but there is never progress. From the C/C++ side, I dont know what to change to get at least the cosim running with a result.



0 Kudos
Registered: ‎09-27-2018

Re: No Kernel write transfer to global memory


so far there is no solution for my problem. The post has about 80 views. Is it possible to get more support from XIlinx here?


Edit: The problem also exists in the SDAccel GUI. After starting the hw_emu build, the compiler never returns.



0 Kudos
Registered: ‎11-04-2010

Re: No Kernel write transfer to global memory

In the hw_emu of the SDAccel, HLS is called to convert C/C++ to RTL and simulate the design.

You should confirm the kernel can work properly in HLS first. 

Don't forget to reply, kudo, and accept as solution.
0 Kudos
Registered: ‎09-27-2018

Re: No Kernel write transfer to global memory



my statement regarding SDAccel GUI was only additional and a try if it works with a slightly different environment (I know SDAccel hw-emu calls the HLS tools). The code is working as C-Sim in HLS 2018.2, but cosimulation fails to complete (runs forever without output). Still no progress here - is it possible I encountered a bug, since no one seems to give me an answer on the code?

0 Kudos