12-16-2016 05:20 AM
I have been trying to implement a bilateral filter for my video processing pipeline (pixel-rate processing), and the function requires taking the exponent to the values from memory window, the prototype is as follows:
F = e^( - ((X-Y)^2 ) / Z^2) * P
The function below takes 2-3 times more resources that are available on the chip. Is that reasonable? Is there the "correct" way to implement such function?
unsigned char gaussianWeights(hls::Window<5,5,unsigned char> *I) { unsigned char out; float FI,F; float sumF=0; float sumFI=0; int row, col; for(row=0; row<5; row++){ for(col=0; col<5; col++){ F = ( expf( -(float)( (I->getval(row,col) - I->getval(2,2))*(I->getval(row,col) - I->getval(2,2)) )/0.02 ) * G[row][col] ); FI = F * I->getval(row,col); sumF = sumF+F; sumFI = sumFI + FI; } } out = (sumFI/sumF)*255; return out; }
12-16-2016 02:57 PM
@Anonymous which chip are you using and what directives are you adding to the synthesis?
12-17-2016 06:09 AM
@muzaffer Using ZC702 board, I am trying to implement a pixel stream 5x5 filter, where I have two for_loops with internal loop pipelined to II=1, and the function above is called within the inner loop to perform calculation on the window. As in the snippet above, this function has no directives, I assume it is automatically inlined to meet pixel rate processing.
12-17-2016 07:48 AM
@Anonymous I tried the function you show by itself and its size is quite reasonable (roughly 20% of what you show). I think your outer loops are doing something funny. Can you show the function which is calling the code you instantiate?
12-17-2016 11:14 AM
Just to be sure: You want to calculate the weights for a gaussian filter with a Kernel size of 5x5?
12-18-2016 02:22 AM
If you've pipelined the outer loop, that's unrolled both of your inner loops - so it's doing 25 of those calculations per clock cycle, plus a uint8-to-float conversion, plus a bunch of multiplies, plus a floating-point divide.
It appears that the whole exponential calculation only requires a single input value:
I->getval(row,col) - I->getval(2,2)
Since the image is 8-bit, the difference is 9-bit. You could build a 36-bit lookup table for this exponential in a single 18K block RAM - and you've got loads of those available. Even if you need to access 25 per cycle (as you do for fully unrolled loops), that's only 13 RAMs.
12-19-2016 05:31 AM
12-19-2016 05:39 AM
12-19-2016 07:01 AM
Well, you could create a lookup table containing all possible output values of the exponential function. Then you could use this difference
I->getval(row,col) - I->getval(2,2)
as "input" (with an offset as an array index).
I think, this is what he meant - at least I would go this way :)
12-19-2016 07:46 AM
12-19-2016 10:45 AM
So, I created an H[256] and g[5][5] arrays of precalculated exponential values for all cases and eliminated the expf() function, but my resource usage is almost as insane as before. Obviously, the problem was not in that function. Should there be specific directives for those arrays? Also, could someone please comment on the content of the arrays? I generated the values externally, but I am not sure if the format is correct.
unsigned char gaussianWeights(hls::Window<5,5,unsigned char> *w) { unsigned char out; ap_ufixed<16,8,AP_TRN_ZERO,AP_SAT> Ic; float F,FI,If; int row,col; int indx; float sumF=0; float sumFI=0; for(row=0; row<5; row++){ for(col=0; col<5; col++){ indx = abs(w->getval(row,col) - w->getval(2,2)); F = H[indx] * G[row][col];
Ic = w->getval(row,col);
If = (float)(Ic>>8); FI = F * If; sumF = sumF+F; sumFI = sumFI + FI; } } out = (sumFI/sumF)*255; return out; }
const float G[5][5] = { { 0.6411804, 0.7574651, 0.8007374, 0.7574651, 0.6411804}, { 0.7574651, 0.8948393, 0.9459595, 0.8948393, 0.7574651}, { 0.8007374, 0.9459595, 1.0000000, 0.9459595, 0.8007374}, { 0.7574651, 0.8948393, 0.9459595, 0.8948393, 0.7574651}, { 0.6411804, 0.7574651, 0.8007374, 0.7574651, 0.6411804}};
const float H[256] = {1,0.99923,0.99693,0.9931,0.98777,0.98096,0.9727,0.96302,0.95198,0.93962,0.92599,0.91116,0.89518,0.87814, 0.8601,0.84113,0.82132,0.80074,0.77947,0.75761,0.73523,0.71241,0.68924,0.6658,0.64217,0.61842,0.59464, 0.57089,0.54725,0.52378,0.50055,0.47762,0.45503,0.43285,0.41111,0.38987,0.36915,0.349,0.32945,0.31051, 0.29221,0.27456,0.25759,0.24129,0.22568,0.21075,0.19651,0.18294,0.17006,0.15783,0.14626,0.13534,0.12503, 0.11533,0.10622,0.097683,0.089691,0.082227,0.075268,0.068792,0.062777,0.0572,0.052038,0.047269,0.042871, 0.038823,0.035103,0.03169,0.028565,0.025709,0.023103,0.020729,0.018571,0.016612,0.014836,0.01323,0.01178, 0.010472,0.0092957,0.0082386,0.0072905,0.0064416,0.0056828,0.0050056,0.0044024,0.0038659,0.0033896, 0.0029674,0.0025938,0.0022637,0.0019727,0.0017164,0.0014911,0.0012934,0.0011201,0.00096862,0.00083632, 0.00072097,0.00062058,0.00053335,0.00045768,0.00039213,0.00033546,0.00028654,0.00024438,0.0002081, 0.00017693,0.0001502,0.00012731,0.00010775,9.1049e-05,7.682e-05,6.4715e-05,5.4433e-05,4.5715e-05, 3.8334e-05,3.2096e-05,2.6831e-05,2.2395e-05,1.8664e-05,1.5531e-05,1.2904e-05,1.0705e-05,8.8666e-06, 7.3329e-06,6.0551e-06,4.9923e-06,4.1097e-06,3.378e-06,2.7723e-06,2.2717e-06,1.8586e-06,1.5183e-06, 1.2384e-06,1.0085e-06,8.201e-07,6.6584e-07,5.3976e-07,4.3688e-07,3.5307e-07,2.849e-07,2.2954e-07,1.8465e-07, 1.4831e-07,1.1894e-07,9.5241e-08,7.6146e-08,6.0785e-08,4.8449e-08,3.8557e-08,3.0638e-08,2.4307e-08,1.9255e-08, 1.523e-08,1.2028e-08,9.484e-09,7.4668e-09,5.8696e-09,4.607e-09,3.6104e-09,2.8251e-09,2.2071e-09,1.7217e-09, 1.341e-09,1.0429e-09,8.0978e-10,6.2782e-10,4.8599e-10,3.7563e-10,2.8988e-10,2.2336e-10,1.7184e-10,1.3201e-10, 1.0125e-10,7.7536e-11,5.9287e-11,4.5263e-11,3.4503e-11,2.6261e-11,1.9957e-11,1.5143e-11,1.1472e-11,8.6782e-12, 6.5545e-12,4.9429e-12,3.7218e-12,2.7981e-12,2.1004e-12,1.5743e-12,1.1781e-12,8.8026e-13,6.5672e-13,4.8919e-13, 3.6384e-13,2.7019e-13,2.0034e-13,1.4832e-13,1.0964e-13,8.0919e-14,5.9631e-14,4.3876e-14,3.2234e-14,2.3645e-14, 1.7318e-14,1.2664e-14,9.2468e-15,6.7413e-15,4.9071e-15,3.5665e-15,2.5881e-15,1.8752e-15,1.3567e-15,9.7996e-16, 7.0678e-16,5.0897e-16,3.6595e-16,2.6272e-16,1.8832e-16,1.3478e-16,9.6316e-17,6.8722e-17,4.8959e-17,3.4825e-17, 2.4734e-17,1.7539e-17,1.2419e-17,8.7794e-18,6.1971e-18,4.3676e-18,3.0735e-18,2.1595e-18,1.515e-18,1.0612e-18, 7.4217e-19,5.1826e-19,3.6135e-19,2.5156e-19,1.7486e-19,1.2136e-19,8.4095e-20,5.8185e-20,4.0196e-20,2.7726e-20, 1.9095e-20,1.3131e-20,9.0157e-21,6.1806e-21,4.2305e-21,2.8913e-21,1.973e-21,1.3443e-21,9.1449e-22,6.2116e-22, 4.2127e-22,2.8527e-22,1.9287e-22};
12-19-2016 02:21 PM
@Anonymous your problem is already diagnosed by @u4223374 "If you've pipelined the outer loop, that's unrolled both of your inner loops - so it's doing 25 of those calculations per clock cycle, plus a uint8-to-float conversion, plus a bunch of multiplies, plus a floating-point divide."
it might help if you partition the H & G array fully. You need to simplify the gaussianweights function in its fully unrolled form.
12-19-2016 02:43 PM
According to @muzaffer's post: I don't know what precision you need, but you could switch to integer math for your calculations and use float only in the end. One possibility is to map your arrays to a specific integer range (eg. H to 16 Bit Integers (1 => 65535 ..... 1/65535 = 1,526e-5 => 1)).