cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
fpgabob
Visitor
Visitor
1,556 Views
Registered: ‎05-07-2018

improve performance with loop pipelining

Jump to solution

Hello,

 

For a research project, I would like to compare hash cracking performances between cpu, gpu and fpga. The principle is quite straightforward: I provide as input a hash value and initial data, and I try to exhaustively find an input and nonce that collide to the provided hash, using a loop.

 

I have developed a kernel in OpenCL for the Sha256 hash function but the performances are not very good. I tried to use loop pipelining for the main loop, but the Initiation Interval (II) is  63 which is far from optimal.

 

For information, the SHA256 hash function uses a compression function which is applied 64 times, and the data uses a message scheduling transformation at each round.

 

The bad performance may be due to interdependences and intradependences in the search loop, but I don't see how it can be improved. I looked at the optimized SHA1 example which appears to reach II=1  (https://github.com/Xilinx/SDAccel_Examples/tree/51416734fd694773a2ab4991f027e5c78e09c9a8/security/sha1), but I fail to see how it solves the variable dependencies or use any particular optimization.

 

Here is the kernel code that I use:

#ifndef uint
#define uint unsigned int
#endif

static const uint KEY[64] = {
	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5,0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5,
	0xd807aa98,0x12835b01,0x243185be,0x550c7dc3,0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174,
	0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc,0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da,
	0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7,0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967,
	0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13,0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85,
	0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3,0xd192e819,0xd6990624,0xf40e3585,0x106aa070,
	0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5,0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3,
	0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208,0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
};

#define SHR(x,n) ((x & 0xFFFFFFFF) >> n)
#define ROTR(x,n) (SHR(x,n) | (x << (32 - n)))

#define S0(x) (ROTR(x, 7) ^ ROTR(x,18) ^  SHR(x, 3))
#define S1(x) (ROTR(x,17) ^ ROTR(x,19) ^  SHR(x,10))

#define S2(x) (ROTR(x, 2) ^ ROTR(x,13) ^ ROTR(x,22))
#define S3(x) (ROTR(x, 6) ^ ROTR(x,11) ^ ROTR(x,25))

#define F0(x,y,z) ((x & y) | (z & (x | y)))
#define F1(x,y,z) (z ^ (x & (y ^ z)))

inline void sha256_process(uint state[8], uint data[16])
{
	uint temp1, temp2, w;
	uint A, B, C, D, E, F, G, H;
	
	A = 0x6A09E667;
	B = 0xBB67AE85;
	C = 0x3C6EF372;
	D = 0xA54FF53A;
	E = 0x510E527F;
	F = 0x9B05688C;
	G = 0x1F83D9AB;
	H = 0x5BE0CD19;
	
	__attribute__((opencl_unroll_hint))
	for (int i = 0; i < 64; ++i) {

		//message schedule
		if(i < 16) {
			w = data[i];
		} else {
			w = data[(i-16) & 15] + S0(data[(i-15) & 15]) + data[(i-7) & 15] + S1(data[(i-2) & 15]); 
		}
		data[i & 15] = w;

		temp1 = H + S3(E) + F1(E,F,G) + KEY[i] + w;
		temp2 = S2(A) + F0(A,B,C);
		H = G;
		G = F;
		F = E;
		E = D + temp1;
		D = C;
		C = B;
		B = A;
		A = temp1 + temp2;
	}

	state[0] = A+0x6A09E667;
	state[1] = B+0xBB67AE85;
	state[2] = C+0x3C6EF372;
	state[3] = D+0xA54FF53A;
	state[4] = E+0x510E527F;
	state[5] = F+0x9B05688C;
	state[6] = G+0x1F83D9AB;
	state[7] = H+0x5BE0CD19;
}

# define NONCEMAX 0xfffffff
__attribute__((reqd_work_group_size(1, 1, 1))) 
kernel void sha256_kernel(
				global uint * out_data,
				global const uint * in_hash, 
				global const uint * in_data)
{	
	uint state[8] __attribute__((xcl_array_partition(complete, 1)));
	uint data[16] __attribute__((xcl_array_partition(complete, 1)));
	uint ldata[16];
uint hash0= in_hash[0];
uint hash1= in_hash[1]; // local copy __attribute__((xcl_pipeline_loop)) LoopInit: for (int i=0; i<16;i++){ldata[i]= in_data[i];} // loop for each hash candidate __attribute__((xcl_pipeline_loop)) LoopNonce: for (uint nonceSuffix = 0; nonceSuffix != NONCEMAX; nonceSuffix++) { __attribute__((xcl_pipeline_loop)) LoopCopyIn: for (int i=0; i<16;i++){ data[i]= ldata[i]; } // set nonce data[12]=nonceSuffix; sha256_process(state, data); // check hash result! if (state[0]==hash0 && state[1]==hash1) { __attribute__((xcl_pipeline_loop)) LoopCopyData: for (int i=0; i<16; i++){ out_data[i]=ldata[i]; } out_data[12]=nonceSuffix; return; } } return; }

The vivado logs provide this information during hw emulation: 

 

INFO: [SCHED 204-61] Pipelining loop 'LoopNonce'.
WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 1 cycles).
WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 2 cycles).
WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 3 cycles).
WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 4 cycles).
WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 35 cycles).
WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 51 cycles).
WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 59 cycles).
WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 61 cycles).
WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 62 cycles).
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 63, Depth: 63.

Thanks in advance for your help and suggestions!

 

0 Kudos
Reply
1 Solution

Accepted Solutions
seanz
Moderator
Moderator
1,778 Views
Registered: ‎03-27-2012

Hi,

 

I don't think there is a problem to pipeline sha256_process and reach II=1. However, you can't just unroll the entire loop. Because this will try to parallel all logics inside to create a complex combination network in order to achieve latency=1. That's why the estimated clock frequency fall.

 

As you mentioned, another problem is the data dependence in the search loop. The search loop stall to wait for sha256_process's output state as its input. To improve this, you may need to place the search loop in another kernel and use PIPE to transfer state between kernels.

For the use of PIPE, you can take a look at dataflow_pipes example:

https://github.com/Xilinx/SDAccel_Examples/tree/51416734fd694773a2ab4991f027e5c78e09c9a8/getting_started/dataflow/dataflow_pipes_ocl

 

Regards,

Sean

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

View solution in original post

0 Kudos
Reply
4 Replies
fpgabob
Visitor
Visitor
1,525 Views
Registered: ‎05-07-2018

The 'inline' directive appears to be ignored by the compiler, so I replaced the method definition "inline void sha256_process(uint state[8], uint data[16])" by  "__attribute__((always_inline)) void sha256_process(uint state[8], uint data[16])".

 

In this case, the loop pipelining appears to work correctly with II=1; but then the estimated frequency falls from 340 to 9.9!

 

The following warning is provided in the compilation logs:

WARNING: [SCHED 204-21] Estimated clock period (100ns) exceeds the target (target clock period: 4ns, clock uncertainty: 1.08ns, effective delay budget: 2.92ns).

Is it possible to have a better trade-off between frequency and II?

0 Kudos
Reply
evant
Xilinx Employee
Xilinx Employee
1,487 Views
Registered: ‎09-08-2011
Hi fpgabob,

Could you attach the text file with the code? It's really hard to read with how the formatting on the forum post has it setup.

I suspect if the frequency is falling, it's because you have too large of a section of the design that you have set to complete between clock cycles. So likely a better placement of pipeline or dataflow locations might help. But I would like to be able to understand the structure a little better.
If at first you don't succeed, try redefining success?
0 Kudos
Reply
fpgabob
Visitor
Visitor
1,484 Views
Registered: ‎05-07-2018

Hello Evant,

 

The kernel code is in the file in attachment.

I removed the '__attribute__((always_inline))' attribute since it caused the estimated frequency to drop too much (and consequently, the code was not synthetizable).

 

This code has an Initiation Interval of 48 between consequent hashes which seems very high. By comparison, the sha1 implementation in example (https://github.com/Xilinx/SDAccel_Examples/tree/master/security/sha1) has an II of 1 despite having very similar structure.

 

Maybe the high II is caused by variable dependencies? In this case, is there a way to improve the II, maybe by dupicating some variables or computations.

 

Best regards,

 

Bob

 

 

0 Kudos
Reply
seanz
Moderator
Moderator
1,779 Views
Registered: ‎03-27-2012

Hi,

 

I don't think there is a problem to pipeline sha256_process and reach II=1. However, you can't just unroll the entire loop. Because this will try to parallel all logics inside to create a complex combination network in order to achieve latency=1. That's why the estimated clock frequency fall.

 

As you mentioned, another problem is the data dependence in the search loop. The search loop stall to wait for sha256_process's output state as its input. To improve this, you may need to place the search loop in another kernel and use PIPE to transfer state between kernels.

For the use of PIPE, you can take a look at dataflow_pipes example:

https://github.com/Xilinx/SDAccel_Examples/tree/51416734fd694773a2ab4991f027e5c78e09c9a8/getting_started/dataflow/dataflow_pipes_ocl

 

Regards,

Sean

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

View solution in original post

0 Kudos
Reply