cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Anonymous
Not applicable
3,580 Views

expf() takes enormous amount of resources

I have been trying to implement a bilateral filter for my video processing pipeline (pixel-rate processing), and the function requires taking the exponent to the values from memory window, the prototype is as follows:

 

F = e^( - ((X-Y)^2 ) / Z^2) * P

 

The function below takes 2-3 times more resources that are available on the chip. Is that reasonable? Is there the "correct" way to implement such function?

 

unsigned char gaussianWeights(hls::Window<5,5,unsigned char> *I)
{
	unsigned char out;
	float FI,F;
	float sumF=0;
	float sumFI=0;
	int row, col;

	for(row=0; row<5; row++){
		for(col=0; col<5; col++){
			F = ( expf( -(float)(  (I->getval(row,col) - I->getval(2,2))*(I->getval(row,col) - I->getval(2,2)) )/0.02 ) * G[row][col] );
			FI = F * I->getval(row,col);
			sumF = sumF+F;
			sumFI = sumFI + FI;
		}
	}

	out = (sumFI/sumF)*255;
	return out;
}
0 Kudos
12 Replies
muzaffer
Teacher
Teacher
3,554 Views
Registered: ‎03-31-2012

@Anonymous which chip are you using and what directives are you adding to the synthesis?

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Anonymous
Not applicable
3,526 Views

@muzaffer Using ZC702 board, I am trying to implement a pixel stream 5x5 filter, where I have two for_loops with internal loop pipelined to II=1, and the function above is called within the inner loop to perform calculation on the window. As in the snippet above, this function has no directives, I assume it is automatically inlined to meet pixel rate processing.

 

Capture.PNG

0 Kudos
muzaffer
Teacher
Teacher
3,516 Views
Registered: ‎03-31-2012

@Anonymous I tried the function you show by itself and its size is quite reasonable (roughly 20% of what you show). I think your outer loops are doing something funny. Can you show the function which is calling the code you instantiate?

 

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
martin-91x
Observer
Observer
3,506 Views
Registered: ‎10-02-2015

 Just to be sure: You want to calculate the weights for a gaussian filter with a Kernel size of 5x5?

0 Kudos
u4223374
Advisor
Advisor
3,481 Views
Registered: ‎04-26-2015

If you've pipelined the outer loop, that's unrolled both of your inner loops - so it's doing 25 of those calculations per clock cycle, plus a uint8-to-float conversion, plus a bunch of multiplies, plus a floating-point divide.

 

It appears that the whole exponential calculation only requires a single input value:

I->getval(row,col) - I->getval(2,2)

Since the image is 8-bit, the difference is 9-bit. You could build a 36-bit lookup table for this exponential in a single 18K block RAM - and you've got loads of those available. Even if you need to access 25 per cycle (as you do for fully unrolled loops), that's only 13 RAMs.

 

 

0 Kudos
Anonymous
Not applicable
3,452 Views

@martin-91x Yes, the window contains neighboring pixels of the pixel of interest (just like in any video pipeline), I have my gaussian coefficients precalculated and stored in array float G[5][5]. What I calculate with the exponential function is the "photometric" weights.
0 Kudos
Anonymous
Not applicable
3,446 Views

@u4223374 I don't quite understand your proposed solution. Could you provide more details, please?
0 Kudos
martin-91x
Observer
Observer
3,423 Views
Registered: ‎10-02-2015

Well, you could create a lookup table containing all possible output values of the exponential function. Then you could use this difference

I->getval(row,col) - I->getval(2,2)

as "input" (with an offset as an array index).

I think, this is what he meant - at least I would go this way :)

0 Kudos
Anonymous
Not applicable
3,439 Views

Yep, I actually started doing it this way. I did not realize that is was only 256 different values. At this point, I still wonder if there was better implementation that would use much less resources if I indeed wanted to calculate values real time.
0 Kudos
Anonymous
Not applicable
2,270 Views

So, I created an H[256] and g[5][5] arrays of precalculated exponential values for all cases and eliminated the expf() function, but my resource usage is almost as insane as before. Obviously, the problem was not in that function. Should there be specific directives for those arrays? Also, could someone please comment on the content of the arrays? I generated the values externally, but I am not sure if the format is correct.

 

unsigned char gaussianWeights(hls::Window<5,5,unsigned char> *w)
{
	unsigned char out;
	ap_ufixed<16,8,AP_TRN_ZERO,AP_SAT> Ic;
	float F,FI,If;
	int row,col;
	int indx;
	float sumF=0;
	float sumFI=0;

	for(row=0; row<5; row++){
		for(col=0; col<5; col++){
			indx = abs(w->getval(row,col) - w->getval(2,2));
			F = H[indx] * G[row][col];
Ic = w->getval(row,col);
If = (float)(Ic>>8); FI = F * If; sumF = sumF+F; sumFI = sumFI + FI; } } out = (sumFI/sumF)*255; return out; }
const float G[5][5] =	{	{ 0.6411804, 0.7574651, 0.8007374, 0.7574651, 0.6411804},
				{ 0.7574651, 0.8948393, 0.9459595, 0.8948393, 0.7574651},
				{ 0.8007374, 0.9459595, 1.0000000, 0.9459595, 0.8007374},
				{ 0.7574651, 0.8948393, 0.9459595, 0.8948393, 0.7574651},
				{ 0.6411804, 0.7574651, 0.8007374, 0.7574651, 0.6411804}};
const float H[256] = {1,0.99923,0.99693,0.9931,0.98777,0.98096,0.9727,0.96302,0.95198,0.93962,0.92599,0.91116,0.89518,0.87814,
		0.8601,0.84113,0.82132,0.80074,0.77947,0.75761,0.73523,0.71241,0.68924,0.6658,0.64217,0.61842,0.59464,
		0.57089,0.54725,0.52378,0.50055,0.47762,0.45503,0.43285,0.41111,0.38987,0.36915,0.349,0.32945,0.31051,
		0.29221,0.27456,0.25759,0.24129,0.22568,0.21075,0.19651,0.18294,0.17006,0.15783,0.14626,0.13534,0.12503,
		0.11533,0.10622,0.097683,0.089691,0.082227,0.075268,0.068792,0.062777,0.0572,0.052038,0.047269,0.042871,
		0.038823,0.035103,0.03169,0.028565,0.025709,0.023103,0.020729,0.018571,0.016612,0.014836,0.01323,0.01178,
		0.010472,0.0092957,0.0082386,0.0072905,0.0064416,0.0056828,0.0050056,0.0044024,0.0038659,0.0033896,
		0.0029674,0.0025938,0.0022637,0.0019727,0.0017164,0.0014911,0.0012934,0.0011201,0.00096862,0.00083632,
		0.00072097,0.00062058,0.00053335,0.00045768,0.00039213,0.00033546,0.00028654,0.00024438,0.0002081,
		0.00017693,0.0001502,0.00012731,0.00010775,9.1049e-05,7.682e-05,6.4715e-05,5.4433e-05,4.5715e-05,
		3.8334e-05,3.2096e-05,2.6831e-05,2.2395e-05,1.8664e-05,1.5531e-05,1.2904e-05,1.0705e-05,8.8666e-06,
		7.3329e-06,6.0551e-06,4.9923e-06,4.1097e-06,3.378e-06,2.7723e-06,2.2717e-06,1.8586e-06,1.5183e-06,
		1.2384e-06,1.0085e-06,8.201e-07,6.6584e-07,5.3976e-07,4.3688e-07,3.5307e-07,2.849e-07,2.2954e-07,1.8465e-07,
		1.4831e-07,1.1894e-07,9.5241e-08,7.6146e-08,6.0785e-08,4.8449e-08,3.8557e-08,3.0638e-08,2.4307e-08,1.9255e-08,
		1.523e-08,1.2028e-08,9.484e-09,7.4668e-09,5.8696e-09,4.607e-09,3.6104e-09,2.8251e-09,2.2071e-09,1.7217e-09,
		1.341e-09,1.0429e-09,8.0978e-10,6.2782e-10,4.8599e-10,3.7563e-10,2.8988e-10,2.2336e-10,1.7184e-10,1.3201e-10,
		1.0125e-10,7.7536e-11,5.9287e-11,4.5263e-11,3.4503e-11,2.6261e-11,1.9957e-11,1.5143e-11,1.1472e-11,8.6782e-12,
		6.5545e-12,4.9429e-12,3.7218e-12,2.7981e-12,2.1004e-12,1.5743e-12,1.1781e-12,8.8026e-13,6.5672e-13,4.8919e-13,
		3.6384e-13,2.7019e-13,2.0034e-13,1.4832e-13,1.0964e-13,8.0919e-14,5.9631e-14,4.3876e-14,3.2234e-14,2.3645e-14,
		1.7318e-14,1.2664e-14,9.2468e-15,6.7413e-15,4.9071e-15,3.5665e-15,2.5881e-15,1.8752e-15,1.3567e-15,9.7996e-16,
		7.0678e-16,5.0897e-16,3.6595e-16,2.6272e-16,1.8832e-16,1.3478e-16,9.6316e-17,6.8722e-17,4.8959e-17,3.4825e-17,
		2.4734e-17,1.7539e-17,1.2419e-17,8.7794e-18,6.1971e-18,4.3676e-18,3.0735e-18,2.1595e-18,1.515e-18,1.0612e-18,
		7.4217e-19,5.1826e-19,3.6135e-19,2.5156e-19,1.7486e-19,1.2136e-19,8.4095e-20,5.8185e-20,4.0196e-20,2.7726e-20,
		1.9095e-20,1.3131e-20,9.0157e-21,6.1806e-21,4.2305e-21,2.8913e-21,1.973e-21,1.3443e-21,9.1449e-22,6.2116e-22,
		4.2127e-22,2.8527e-22,1.9287e-22};

 

0 Kudos
muzaffer
Teacher
Teacher
2,248 Views
Registered: ‎03-31-2012

@Anonymous your problem is already diagnosed by @u4223374 "If you've pipelined the outer loop, that's unrolled both of your inner loops - so it's doing 25 of those calculations per clock cycle, plus a uint8-to-float conversion, plus a bunch of multiplies, plus a floating-point divide."

 

it might help if you partition the H & G array fully. You need to simplify the gaussianweights function in its fully unrolled form.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
martin-91x
Observer
Observer
2,235 Views
Registered: ‎10-02-2015

According to @muzaffer's post: I don't know what precision you need, but you could switch to integer math for your calculations and use float only in the end. One possibility is to map your arrays to a specific integer range (eg. H to 16 Bit Integers (1 => 65535 ..... 1/65535 = 1,526e-5 => 1)).

0 Kudos