UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Participant stefanoribes
Participant
327 Views
Registered: ‎07-25-2016

Post-addition of MAC operation not mapped to DSP (fixed point operands)

Jump to solution

Hi everyone,

I'm unsuccessfully trying to map a multiply accumulate (mac) operation on a DSP48E block in Vivado HLS 2018.2 (targeting a ZCU104 board). In details, I'm unable to map the post-adder operation to the DSPs, which is always mapped to LUTs instead. I believe all the operands in my code follow this checklist, that's why I'd like some extra advice in order to achieve my goal.

The following code snippet illustrates my problem:

typedef ap_fixed<16, 7, AP_RND_ZERO, AP_SAT_SYM> DataType;
typedef ap_fixed<18, 9, AP_RND_ZERO, AP_SAT_SYM> AccumType;

template <int NumIter, int VectLength>
void MAC(hls::stream<DataType> &x1,
         hls::stream<DataType> &x2,
         hls::stream<DataType> &w,
         hls::stream<DataType> &y1,
         hls::stream<DataType> &y2) {
  AccumType y1_mac = 0;
  AccumType y2_mac = 0;

  for (int i = 0; i < NumIter; ++i) {
    for (int j = 0; j < VectLength; ++j) {
#pragma HLS PIPELINE II=1
      if (j == 0) {
        y1_mac = 0;
        y2_mac = 0;
      }
      auto w_tmp = w.read();
      auto mac1 = x1.read() * w_tmp;
      auto mac2 = x2.read() * w_tmp;
      mac1 += y1_mac;
      mac2 += y2_mac;
#pragma HLS RESOURCE variable=mac1 core=DSP48
#pragma HLS RESOURCE variable=mac2 core=DSP48
      y1_mac = mac1;
      y2_mac = mac2;
      if (j == VectLength - 1) {
        y1.write(y1_mac);
        y2.write(y2_mac);
      }
    }
  }
}

I've already tried different approaches, but all of them failed (they just mapped only the multiplication to the DSP, not the post-addition):

  • different accumulate type bitwidth, both narrowed (16bit) and wider (32) (I still need those saturation and quantization modes though);
  • avoided using the intermediate variables (mac1 and mac2) and the resource pragmas;
  • output streams set to AccumType (even though I'd prefer not go with this solution, which anyway doesn't work);
  • this solution using a "Register" class.

Is there any coding guideline I haven't found/followed yet? Why do I only get the multiplication and not the full mac operation? Do the saturation and quantization modes prevent the mapping? Maybe the last if-statement is another reason why the mapping is not happening?

Any help or tip will be highly appreciated,

Best,

Stefano

0 Kudos
1 Solution

Accepted Solutions
Contributor
Contributor
212 Views
Registered: ‎03-31-2017

Re: Post-addition of MAC operation not mapped to DSP (fixed point operands)

Jump to solution

I think it has to do with the rounding operation.  AP_TRN seems to work

//typedef ap_fixed<16, 7, AP_RND_ZERO, AP_SAT_SYM> DataType;
//typedef ap_fixed<18, 9, AP_RND_ZERO, AP_SAT_SYM> AccumType;

typedef ap_fixed<16, 7, AP_TRN, AP_SAT_SYM> DataType;
typedef ap_fixed<18, 9, AP_TRN, AP_SAT_SYM> AccumType;

template <int NumIter, int VectLength>
void MAC(hls::stream<DataType> &x1,
         hls::stream<DataType> &x2,
         hls::stream<DataType> &w,
         hls::stream<AccumType> &y1,
         hls::stream<AccumType> &y2) {

	AccumType y1_mac;
	AccumType y2_mac;

  iLoop: for (int i = 0; i < NumIter; ++i) {
    jLoop: for (int j = 0; j < VectLength; ++j) {
#pragma HLS PIPELINE II=1
      auto w_tmp = w.read();
      auto x1_tmp = x1.read();
      auto x2_tmp = x2.read();
      if (j == 0) {
    	y1_mac = 0;
    	y2_mac = 0;
      }
      y1_mac += (x1_tmp * w_tmp);
      y2_mac += (x2_tmp * w_tmp);
    }
    y1.write(y1_mac);
    y2.write(y2_mac);
  }
}
0 Kudos
7 Replies
Moderator
Moderator
245 Views
Registered: ‎05-27-2018

回复: Post-addition of MAC operation not mapped to DSP (fixed point operands)

Jump to solution

Hi @stefanoribes ,

     Can you provide your source code include the .pp and .h file ?

Wen

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
如果提供的信息能解决您的问题,请标记为“接受为解决方案”。
如果您认为帖子有帮助,请点击“奖励”。谢谢!
-------------------------------------------------------------------------------------------------
0 Kudos
Participant stefanoribes
Participant
232 Views
Registered: ‎07-25-2016

回复: Post-addition of MAC operation not mapped to DSP (fixed point operands)

Jump to solution

Hi,

There's not so much else to add: the algorithm is mac-ing VectorLength 'new' elements for NumIter times. The if-statements perfectly flatten the loops and generate a pipeline with II=1 (which is exactly what I wanted to achieve here; no concerns on this).

A top function could be as simple as the following:

void MAC_top(hls::stream<DataType> &x1,
         hls::stream<DataType> &x2,
         hls::stream<DataType> &w,
         hls::stream<DataType> &y1,
         hls::stream<DataType> &y2) {
  MAC<16, 512>(x1, x2, w, y1, y2);
}

(shall I zip a small HLS project with the two files?)

BR,

Stefano

0 Kudos
Contributor
Contributor
213 Views
Registered: ‎03-31-2017

Re: Post-addition of MAC operation not mapped to DSP (fixed point operands)

Jump to solution

I think it has to do with the rounding operation.  AP_TRN seems to work

//typedef ap_fixed<16, 7, AP_RND_ZERO, AP_SAT_SYM> DataType;
//typedef ap_fixed<18, 9, AP_RND_ZERO, AP_SAT_SYM> AccumType;

typedef ap_fixed<16, 7, AP_TRN, AP_SAT_SYM> DataType;
typedef ap_fixed<18, 9, AP_TRN, AP_SAT_SYM> AccumType;

template <int NumIter, int VectLength>
void MAC(hls::stream<DataType> &x1,
         hls::stream<DataType> &x2,
         hls::stream<DataType> &w,
         hls::stream<AccumType> &y1,
         hls::stream<AccumType> &y2) {

	AccumType y1_mac;
	AccumType y2_mac;

  iLoop: for (int i = 0; i < NumIter; ++i) {
    jLoop: for (int j = 0; j < VectLength; ++j) {
#pragma HLS PIPELINE II=1
      auto w_tmp = w.read();
      auto x1_tmp = x1.read();
      auto x2_tmp = x2.read();
      if (j == 0) {
    	y1_mac = 0;
    	y2_mac = 0;
      }
      y1_mac += (x1_tmp * w_tmp);
      y2_mac += (x2_tmp * w_tmp);
    }
    y1.write(y1_mac);
    y2.write(y2_mac);
  }
}
0 Kudos
Xilinx Employee
Xilinx Employee
185 Views
Registered: ‎09-04-2017

Re: Post-addition of MAC operation not mapped to DSP (fixed point operands)

Jump to solution

That's correct. The issue seems to be with rounding operation. AP_RND also seems to work instead of AP_RND_ZERO

 

Thanks,

Nithin

0 Kudos
Moderator
Moderator
113 Views
Registered: ‎11-21-2018

Re: Post-addition of MAC operation not mapped to DSP (fixed point operands)

Jump to solution

Hi @stefanoribes 

If your question is answered or your issue is solved, please kindly mark the response which helped as solution (click on "Accept as solution" button below the reply)

 

If this is not solved/answered, please reply in the topic giving more information on your current status.

 

Thanks and Regards,

Aoife
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
Participant stefanoribes
Participant
86 Views
Registered: ‎07-25-2016

Re: Post-addition of MAC operation not mapped to DSP (fixed point operands)

Jump to solution

Hi all,

thank you all for the suggestions and in particular @p27803 and @nithink: it was indeed the rounding. I preferred not touching the rouding mode because of the algorithm I'm implementing, but it really seems the only way to fully exploit the DSPs. Thanks again.

BR,

Stefano

0 Kudos
Moderator
Moderator
62 Views
Registered: ‎05-27-2018

Re: Post-addition of MAC operation not mapped to DSP (fixed point operands)

Jump to solution

Hi @stefanoribes ,

    Happy to see the workround was found. When I dug into UG902, a tip was found saying:

    Quantization and overflow modes that do more than the default behavior of standard hardware arithmetic (wrap and truncate) result in operators with more associated hardware. It costs logic (LUTs) to implement the more advanced modes, such as round to minus infinity or saturate symmetrically

    In addition, the following example shows that the typedef statement for the accumulation variable uses the AP_TRN quantization mode. I guess the accumulating operation uses truncation mode to consume less resources.

#include "ap_fixed.h"
typedef ap_ufixed<10,8, AP_RND, AP_SAT> din1_t;
typedef ap_fixed<6,3, AP_RND, AP_WRAP> din2_t;
typedef ap_fixed<22,17, AP_TRN, AP_SAT> dint_t;
typedef ap_fixed<36,30> dout_t;
dout_t cpp_ap_fixed(din1_t d_in1, din2_t d_in2) {
static dint_t sum;
sum += d_in1;
return sum * d_in2;
}

 

Thanks,

Wen

 

 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
如果提供的信息能解决您的问题,请标记为“接受为解决方案”。
如果您认为帖子有帮助,请点击“奖励”。谢谢!
-------------------------------------------------------------------------------------------------