cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
bigbrett
Adventurer
Adventurer
5,642 Views
Registered: ‎07-08-2016

Can't find any directives to reduce LUT utilization

Hi there,

 

I'm trying to implement 2048-bit RSA in HLS, using the Tenca-koc multiple-word radix 2 montgomery multiplication algorithm, shown below:

mwr2mm.png

 

I wrote this algorithm in HLS and it works in simulation and in C/RTL cosim. My algorithm is here:

 

#define MWR2MM_m 2048  // Bit-length of operands
#define MWR2MM_w 8     // word size
#define MWR2MM_e 257   // number of words per operand

// Type definitions
typedef ap_uint<1> bit_t;             // 1-bit scan
typedef ap_uint< MWR2MM_w > word_t;     // 8-bit words
typedef ap_uint< MWR2MM_m > rsaSize_t;  // m-bit operand size
/* * Multiple-word radix 2 montgomery multiplication using carry-propagate adder */ void mwr2mm_cpa(rsaSize_t X, rsaSize_t Yin, rsaSize_t Min, rsaSize_t* out) { // extend operands to 2 extra words of 0 ap_uint<MWR2MM_m + 2*MWR2MM_w> Y = Yin; ap_uint<MWR2MM_m + 2*MWR2MM_w> M = Min; ap_uint<MWR2MM_m + 2*MWR2MM_w> S = 0; ap_uint<2> C = 0; // two carry bits bit_t qi = 0; // an intermediate result bit // Store concatenations in a temporary variable to eliminate HLS compiler warnings about shift count ap_uint<MWR2MM_w> temp_concat=0; // scan X bit-by bit for (int i=0; i<MWR2MM_m; i++) { qi = (X[i]*Y[0]) xor S[0]; // C gets top two bits of temp_concat, j'th word of S gets bottom 8 bits of temp_concat temp_concat = X[i]*Y.range(MWR2MM_w-1,0) + qi*M.range(MWR2MM_w-1,0) + S.range(MWR2MM_w-1,0); C = temp_concat.range(9,8); S.range(MWR2MM_w-1,0) = temp_concat.range(7,0); // scan Y and M word-by word, for each bit of X for (int j=1; j<=MWR2MM_e; j++) { temp_concat = C + X[i]*Y.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + qi*M.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j); C = temp_concat.range(9,8); S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) = temp_concat.range(7,0); S.range(MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)) = (S.bit(MWR2MM_w*j), S.range( MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)+1)); } S.range(S.length()-1, S.length()-MWR2MM_w) = 0; C=0; } // if final partial sum is greater than the modulus, bring it back to proper range if (S >= M) S -= M; *out = S; }

Unfortunately, the LUT utilization is huge (125%). This is problematic because I need to be able to fit multiple of these blocks in hardware as axi4-lite slaves.

 

Could someone please provide a few suggestions as to how I can reduce the LUT utilization? I've already tried various different word lengths, switching the top level inputs to arrays so they are BRAM, and applying all sorts of directives (compartmentalizing into multiple inline functions, dataflow architecture, resource limits on lshr, etc.) but nothing really drives the LUT utilization down in a meaningful way.  Is there a glaringly obvious way that I could reduce the utilization that is apparent to anyone?

 

In particular, I've seen papers on implementations of the mwr2mm algorithm that only use one DSP block and one BRAM (paper here). Is this even worth attempting to implement using HLS? Or is there no way that I can actually control the resources that the algorithm is mapped to without describing it in HDL?

 

More details on the algorithm can be found here: https://ece.gmu.edu/~kgaj/publications/journals/GWU_GMU_TOC_2011.pdf

 

Thanks for the help.

0 Kudos
12 Replies
hpoetzl
Voyager
Voyager
5,618 Views
Registered: ‎06-24-2013

As far as I can tell, the algorithm you implemented processes the full 2048bit in one go, while the Single DSP+BRAM version takes 123940864 cycles (278ms @447 MHz) to crunch through the data in 17bit steps.

 

So you probably need to adapt the algorithm to process smaller chunks in a pipelined way to reduce the amount of resources.

 

Is it worth the eford?

Given that there is an optimized and verified design for the DSP48E1+BRAM presented in the referenced paper, the answer is probably no.

 

Hope this helps,

Herbert

-------------- Yes, I do this for fun!
0 Kudos
bigbrett
Adventurer
Adventurer
5,613 Views
Registered: ‎07-08-2016

Thanks for the response @hpoetzl.

 

"As far as I can tell, the algorithm you implemented processes the full 2048bit in one go"

Well, in one function call, yes....but not in parallel if thats what you mean? It processes the multiplier bit-by bit, and the multiplicand/modulus word-by-word. I've tried making functions that handle these subroutines, and pipelining them but I still have through-the-roof LUT utilization.

 

"...process smaller chunks in a pipelined way to reduce the amount of resources."

Yes, the problem is that pipelining only increases FF utilization, and has no effect on LUTs. The LUTS are primarily implemented for the logical bitwise operations such as shrl and and/or gates. Even when X, Y, M, and S are all stored in BRAM (i.e. no large 2048-bit vectors) the LUT utilization is still over 100%

 

I would have thought that a tool like HLS would excel in a case like this (mapping a high-level algorithmic description into hardware), but I am having trouble getting the algorithm to line up with the hardware.

0 Kudos
hpoetzl
Voyager
Voyager
5,582 Views
Registered: ‎06-24-2013

The code you've provided has errors (first the #defined MWR1MM_e is not used anywhere, but MWR2MM_e is, then the temp_concat doesn't have the necessary bit-length to hold the carry), but anyway, here are a few ideas how to improve the HLS design:

 

  • change the arguments to arrays of word size (this will allow to put it in BRAMs)
  • add pragmas to make input and output a memory interface (ap_memory or bram)
  • limit the number of operation instances
  • increase the allowed latency
  • add pragmas to force the math into DSP slices

Hope it helps,

Herbert

-------------- Yes, I do this for fun!
u4223374
Advisor
Advisor
5,569 Views
Registered: ‎04-26-2015

What does the synthesis report say about where the resources are being used?

 

My guess is that it's all the variable-indexed ranges referenced in your "j" loop. Just unrolling that loop could well cut your resources and also boost speed.

bigbrett
Adventurer
Adventurer
5,542 Views
Registered: ‎07-08-2016

@hpoetzl good catch, but unfortunately those errors are artifacts of me copy-pasting and formatting into this post. Not present in my actual design.

 

I've already tried your suggestions of 1) changing arguments of arrays of word size, and 2) adding pragmas to make i/o ap_memory. This actually was the most help, as it brought the utilization down to ~90% LUTs. However, I'm still trying to get down to around 30-50%.

 

Regarding the other suggestions:

 

  • Limit the number of operation instances
    • I've already tried this, (limiting lshrl operator to 1....thought i'd try the extreme case first) and it only reduced LUT util by ~10% and pushed latency north of 20 billion clock cycles :)
  • increase allowed latency:
    • I didn't think of this, I'll give it a shot and report back
  • Add pragmas to force math into DSP slices
    • I couldn't figure out a way to do this. Every time I tried I got errors saying it wasn't possible. I will try rearranging the operands into temporary products and try again.
0 Kudos
bigbrett
Adventurer
Adventurer
5,541 Views
Registered: ‎07-08-2016

@u4223374 hmmmm ok I will try unrolling it. I didn't even try that before, since I assumed unrolling a loop would always ADD resources. And also, I thought that you couldn't unroll loops that had dependencies on prior loop iterations (which the partial sum variable S does....right?).

 

Nonetheless I'll try.

0 Kudos
bigbrett
Adventurer
Adventurer
5,528 Views
Registered: ‎07-08-2016

@u4223374 unrolling the inner loop didn't actually have any effect on area.

0 Kudos
u4223374
Advisor
Advisor
5,525 Views
Registered: ‎04-26-2015

@bigbrett Good point, I didn't notice the loop dependence.

 

You can still unroll it, and it might save a lot of resources because it won't be trying to build big variable-index multiplexers to handle your range() calls. On the other hand, the state machine it has to build to handle the dependency might make it bigger.

 

I've now had a better look at this and I can see two potential approaches.

One is to push Y/M/S into RAMs. On each cycle of the "j" loop you're essentially taking an 8-bit section of Y, and 8-bit section of M, and an 8-bit section of S. Then you update the same 8-bit section of S and the previous 8-bit section of S. Building this as a 256*8 RAM would make all that tricky indexing essentially free, although potentially a bit slow (probably a few cycles per iteration of the "j" loop). The comparison after the loop will also require more work this way, but it'll probably be much more efficient.

The second idea is that on each cycle you update index "j" and "j-1" (where each index refers to an 8-bit section). After that, you never touch the j-1 section again. This would fit pretty well with a shift register layout, which eliminates all those nasty variable-index range calls. Something along these lines to replace the "j" loop:

ap_uint<8> StmpNew = 0, StmpOld = temp_concat.range(7,0);

for (int j = 1; j <= MWR2MM_e; j++) {
    temp_concat = C + X[i] * Y.range(MWR2MM_w * 2 - 1, MWR2MM_w) + qi * M.range(MWR2MM_w * 2 - 1, MWR2MM_w) + StmpNew;
    C = temp_concat.range(9,8);
    StmpNew = temp_concat.range(7,0);

    StmpOld = (StmpNew.bit(0), StmpOld.range(7,1));

    // Now update all the arrays.

    // Shift S down to make space for StmpOld, and shift it down.
    S >>= MWR2MM_w;
    S.range(MWR2MM_m + 2*MWR2MM_w - 1,MWR2MM_m + MWR2MM_w) = StmpOld;

    // StmpOld for the next iteration is StmpNew from this iteration.
    StmpOld = StmpNew;

    // Shift Y and M down to expose a new section of them.
    Y >>= MWR2MM_w;
    M >>= MWR2MM_w;
}
bigbrett
Adventurer
Adventurer
5,468 Views
Registered: ‎07-08-2016

OK so I've made some good progress on this. I actually took a modified version of your advice @u4223374 and moved as much into BRAM as possible. This, combined with a new version of the algorithm that (supposedly, were I coding in HDL) uses Carry-save adders (CSA) instead of CPA. The pseudocode for my new algorithm, as well as the HLS code is as follows:

algo.png

 

void mwr2mm_csa_bram(ap_uint<MWR2MM_m> X,
		     ap_uint<MWR2MM_w> Y[MWR2MM_e+1],
		     ap_uint<MWR2MM_w> M[MWR2MM_e+1],
		     ap_uint<MWR2MM_w> S[MWR2MM_e+1])
{
	// temp variables
	ap_uint<MWR2MM_w> Y0 = Y[0];
	ap_uint<MWR2MM_w> M0 = M[0];

	// Two Carry bits
	ap_uint<2> Ca=0,Cb=0;

	for (int i=0; i<MWR2MM_m; i++)
	{
		(Ca,S[0]) = X[i]*Y0 + S[0];
		if (S[0][0] == 1) // if the 0th bit of the 0th word is 1
		{
			(Cb,S[0]) = S[0] + M0;

			mwr2mm_csa_bram_label0: for (int j=1; j <= MWR2MM_e; j++)
			{
				(Ca, S[j]) = Ca + X[i]*Y[j] + S[j];
				(Cb, S[j]) = Cb + M[j] + S[j];
				S[j-1] = ( S[j][0], S[j-1].range(MWR2MM_w-1,1) );
			}
		}
		else
		{
			mwr2mm_csa_bram_label1: for (int j=1; j<=MWR2MM_e; j++)
			{
				(Ca, S[j]) = Ca + X[i]*Y[j] + S[j];
				S[j-1] = ( S[j][0], S[j-1].range(MWR2MM_w-1,1) );
			}
		}
		S[MWR2MM_e] = 0;
		Ca=0;
		Cb=0;
	}
}

And the top level function, with axi-lite slave interfaces here:

 

void mwr2mm(ap_uint<MWR2MM_m> X,
			ap_uint<MWR2MM_w> Y[MWR2MM_e+1],
			ap_uint<MWR2MM_w> M[MWR2MM_e+1],
			ap_uint<MWR2MM_w> ans[MWR2MM_e+1])
{
	ap_uint<MWR2MM_w> S[MWR2MM_e+1] = {	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
						0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
						0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
						0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
						0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
						0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
						0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
						0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
						0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};

        // do all montgomery multiplication steps except final subtraction of modulus
	mwr2mm_csa_bram(X,Y,M,S);

        // TODO now just need to check if S>=M and subtract modulus

        // copy results back to memory.....would this be a good place to do bitwise subtraction? 
	for (int i=0; i<=MWR2MM_e; i++)
		ans[i] = S[i];

}

And this results in the new-and-improved utilization numbers here (!!):

util.png

 

Now, the new problem is performing the check for whether S>=M and then subtracting the modulus.......both values are stored in BRAM arrays......crap!

 

I tried the naive way of just copying the values from BRAM into respective ap_uint<2048> vectors and performing the subtraction, but as expected, this blew up LUT utilization again. Another idea I had was performing bitwise subtraction, however all those reads and writes are sure to cause the design to fail timing, and cause absurd latency.

 

My question is now this: is there a way that HLS could impement this final subtraction on memory efficiently? Bitwise subtraction as I'm writing words back into PS memory over the AXI bus seems to me a potentially viable solution, however I can't think of the best way to properly pipeline this such that latency is reduced. Is there a way that I can perform this final subtraction without failing timing? I have been trying to use the DSP48 blocks to do the subtraction, but can't figure out a way to do that....

 

 

any and all thoughts and advice appreciated. And thanks all for the help so far!

 

 

0 Kudos
u4223374
Advisor
Advisor
4,152 Views
Registered: ‎04-26-2015

@bigbrett Good to hear that you're making progress. I'll have a better look a bit later, but as a first approach you could try copying them into uint2048 shift registers rather than simple registers. That is, rather than copying each RAM element to a different location in the output register, copy each element to the top location and then shift the register down. This eliminates the huge multiplexers that HLS is so fond of building.

bigbrett
Adventurer
Adventurer
4,133 Views
Registered: ‎07-08-2016

@u4223374 Okay, I'll try that and report back. Thanks for the help!

0 Kudos
bigbrett
Adventurer
Adventurer
4,074 Views
Registered: ‎07-08-2016

Hey @u4223374, so I was able to successfully implement what you suggested: copying the two operands into uint2048 shift registers, and then subtracting in one go. This does have a pretty hefty utilization cost (~50% LUT in HLS synthesis), but I'm able to fit it on chip alongside my other accelerators. However, looking forward, I'd like to have a bunch of these blocks in hardware at the same time, so I think I will have to find a better way to do this....

 

Any other ideas to make the final subtraction of the two operands (in BRAM) more resource-efficient? I tried to just do simple binary subtraction bit-by-bit (i.e. looping through the BRAM holding both operands and keeping track of the carry/borrows), but this resulted in unacceptable latency (~20billion cycles!). I imagine using the pipelined adders in the DSP48s could be a viable option? Unfortunately I've been struggling to get the operation to map to the DSPs without errors, so maybe this isn't the way to go...

 

A side note: every IP block generated from the various HLS designs I've tried have all failed timing during the Vivado Implementation stage. However, I'm not worrying about that right now as it is acceptable to just decrease the PL clock speed to 90 MHz (For the time being at least....). I would eventually like to run this at >100Mhz but thats another problem for another day :-)

0 Kudos