05-07-2018 01:44 AM - edited 05-07-2018 06:00 AM
Hello,
For a research project, I would like to compare hash cracking performances between cpu, gpu and fpga. The principle is quite straightforward: I provide as input a hash value and initial data, and I try to exhaustively find an input and nonce that collide to the provided hash, using a loop.
I have developed a kernel in OpenCL for the Sha256 hash function but the performances are not very good. I tried to use loop pipelining for the main loop, but the Initiation Interval (II) is 63 which is far from optimal.
For information, the SHA256 hash function uses a compression function which is applied 64 times, and the data uses a message scheduling transformation at each round.
The bad performance may be due to interdependences and intradependences in the search loop, but I don't see how it can be improved. I looked at the optimized SHA1 example which appears to reach II=1 (https://github.com/Xilinx/SDAccel_Examples/tree/51416734fd694773a2ab4991f027e5c78e09c9a8/security/sha1), but I fail to see how it solves the variable dependencies or use any particular optimization.
Here is the kernel code that I use:
#ifndef uint #define uint unsigned int #endif static const uint KEY[64] = { 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5,0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5, 0xd807aa98,0x12835b01,0x243185be,0x550c7dc3,0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174, 0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc,0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da, 0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7,0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967, 0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13,0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85, 0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3,0xd192e819,0xd6990624,0xf40e3585,0x106aa070, 0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5,0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3, 0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208,0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2 }; #define SHR(x,n) ((x & 0xFFFFFFFF) >> n) #define ROTR(x,n) (SHR(x,n) | (x << (32 - n))) #define S0(x) (ROTR(x, 7) ^ ROTR(x,18) ^ SHR(x, 3)) #define S1(x) (ROTR(x,17) ^ ROTR(x,19) ^ SHR(x,10)) #define S2(x) (ROTR(x, 2) ^ ROTR(x,13) ^ ROTR(x,22)) #define S3(x) (ROTR(x, 6) ^ ROTR(x,11) ^ ROTR(x,25)) #define F0(x,y,z) ((x & y) | (z & (x | y))) #define F1(x,y,z) (z ^ (x & (y ^ z))) inline void sha256_process(uint state[8], uint data[16]) { uint temp1, temp2, w; uint A, B, C, D, E, F, G, H; A = 0x6A09E667; B = 0xBB67AE85; C = 0x3C6EF372; D = 0xA54FF53A; E = 0x510E527F; F = 0x9B05688C; G = 0x1F83D9AB; H = 0x5BE0CD19; __attribute__((opencl_unroll_hint)) for (int i = 0; i < 64; ++i) { //message schedule if(i < 16) { w = data[i]; } else { w = data[(i-16) & 15] + S0(data[(i-15) & 15]) + data[(i-7) & 15] + S1(data[(i-2) & 15]); } data[i & 15] = w; temp1 = H + S3(E) + F1(E,F,G) + KEY[i] + w; temp2 = S2(A) + F0(A,B,C); H = G; G = F; F = E; E = D + temp1; D = C; C = B; B = A; A = temp1 + temp2; } state[0] = A+0x6A09E667; state[1] = B+0xBB67AE85; state[2] = C+0x3C6EF372; state[3] = D+0xA54FF53A; state[4] = E+0x510E527F; state[5] = F+0x9B05688C; state[6] = G+0x1F83D9AB; state[7] = H+0x5BE0CD19; } # define NONCEMAX 0xfffffff __attribute__((reqd_work_group_size(1, 1, 1))) kernel void sha256_kernel( global uint * out_data, global const uint * in_hash, global const uint * in_data) { uint state[8] __attribute__((xcl_array_partition(complete, 1))); uint data[16] __attribute__((xcl_array_partition(complete, 1))); uint ldata[16];
uint hash0= in_hash[0];
uint hash1= in_hash[1]; // local copy __attribute__((xcl_pipeline_loop)) LoopInit: for (int i=0; i<16;i++){ldata[i]= in_data[i];} // loop for each hash candidate __attribute__((xcl_pipeline_loop)) LoopNonce: for (uint nonceSuffix = 0; nonceSuffix != NONCEMAX; nonceSuffix++) { __attribute__((xcl_pipeline_loop)) LoopCopyIn: for (int i=0; i<16;i++){ data[i]= ldata[i]; } // set nonce data[12]=nonceSuffix; sha256_process(state, data); // check hash result! if (state[0]==hash0 && state[1]==hash1) { __attribute__((xcl_pipeline_loop)) LoopCopyData: for (int i=0; i<16; i++){ out_data[i]=ldata[i]; } out_data[12]=nonceSuffix; return; } } return; }
The vivado logs provide this information during hw emulation:
INFO: [SCHED 204-61] Pipelining loop 'LoopNonce'. WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 1 cycles). WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 2 cycles). WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 3 cycles). WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 4 cycles). WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 35 cycles). WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 51 cycles). WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 59 cycles). WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 61 cycles). WARNING: [SCHED 204-64] Unable to schedule the loop exit test ('call' operation ('tmp_s') to '__xlnx_cl__Z12check_resultPji') in the first pipeline iteration (II = 62 cycles). INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 63, Depth: 63.
Thanks in advance for your help and suggestions!
05-09-2018 01:39 AM
Hi,
I don't think there is a problem to pipeline sha256_process and reach II=1. However, you can't just unroll the entire loop. Because this will try to parallel all logics inside to create a complex combination network in order to achieve latency=1. That's why the estimated clock frequency fall.
As you mentioned, another problem is the data dependence in the search loop. The search loop stall to wait for sha256_process's output state as its input. To improve this, you may need to place the search loop in another kernel and use PIPE to transfer state between kernels.
For the use of PIPE, you can take a look at dataflow_pipes example:
Regards,
Sean
05-07-2018 05:59 AM
The 'inline' directive appears to be ignored by the compiler, so I replaced the method definition "inline void sha256_process(uint state[8], uint data[16])" by "__attribute__((always_inline)) void sha256_process(uint state[8], uint data[16])".
In this case, the loop pipelining appears to work correctly with II=1; but then the estimated frequency falls from 340 to 9.9!
The following warning is provided in the compilation logs:
WARNING: [SCHED 204-21] Estimated clock period (100ns) exceeds the target (target clock period: 4ns, clock uncertainty: 1.08ns, effective delay budget: 2.92ns).
Is it possible to have a better trade-off between frequency and II?
05-07-2018 01:06 PM
05-07-2018 01:39 PM
Hello Evant,
The kernel code is in the file in attachment.
I removed the '__attribute__((always_inline))' attribute since it caused the estimated frequency to drop too much (and consequently, the code was not synthetizable).
This code has an Initiation Interval of 48 between consequent hashes which seems very high. By comparison, the sha1 implementation in example (https://github.com/Xilinx/SDAccel_Examples/tree/master/security/sha1) has an II of 1 despite having very similar structure.
Maybe the high II is caused by variable dependencies? In this case, is there a way to improve the II, maybe by dupicating some variables or computations.
Best regards,
Bob
05-09-2018 01:39 AM
Hi,
I don't think there is a problem to pipeline sha256_process and reach II=1. However, you can't just unroll the entire loop. Because this will try to parallel all logics inside to create a complex combination network in order to achieve latency=1. That's why the estimated clock frequency fall.
As you mentioned, another problem is the data dependence in the search loop. The search loop stall to wait for sha256_process's output state as its input. To improve this, you may need to place the search loop in another kernel and use PIPE to transfer state between kernels.
For the use of PIPE, you can take a look at dataflow_pipes example:
Regards,
Sean