10-22-2019 10:03 AM - edited 10-22-2019 10:35 AM
I'm trying to implement a cubic root function on floating point numbers, which is not existing in the IP category. I basically followed the Newton-Raphson method as proposed in the post here.
I used the single precision FP IP on VU9P to implement those FP add (14 cc latency), multiplication (9 cc latency) and division (31 cc latency) functions. However, due to the data dependency in there, those functions cannot be executed in parallel. My Verilog code is attached.
Based on my simulation, I found that even with proper table lookup implementation for that intial guess value, it still needs a few (two or more) iterations to get an relatively accurate value, which will take a few hunderds of cycles to finish.
While in comparision, the latency for IPs like FP exponential (32 cc latency), FP log (31 cc), FP square root (31 cc) etc. can return relatively much better accuracy with shorter latency. Can anyone shed some light on how are those IP's calculate those values? What kind of algorithm they use? Is there a better way to accelerate my cubic root implementation?
module cbrt_inner_loop_single #( parameter DATA_WIDTH = 32 ) ( input clk, input rstn, input [DATA_WIDTH-1:0] in_cbrt_raw, input in_cbrt_raw_valid, input [DATA_WIDTH-1:0] in_cbrt_guess, input in_cbrt_guess_valid, output [DATA_WIDTH-1:0] out_cbrt_raw, output out_cbrt_raw_valid ); // Stage 1: part_1_out = 2/3 * in_cbrt_guess // Latency: 9 cycles FP_Single_Mul FP_Part_1( .... ); // Delay part_1_out from FP_Part_1 by 9+31 = 40 cycles // Wait for Part_2_tmp_2 to finish delay_register delay_part_1 .... ); // Stage 1: part_2_tmp_1 = in_cbrt_guess ^ 2 // Latency: 9 cycles FP_Single_Mul FP_Part_2_1( ... ); // Stage 2: part_2_tmp_2 = part_2_tmp_1 * 3 // Latency: 9 cycles FP_Single_Mul FP_Part_2_2( ... ); // Delay cbrt_raw from input by 9*2 = 18 cycles // Wait for Part_2_tmp_1 and Part_2_tmp_2 finish delay_register delay_cbrt_raw ( ... ); // Stage 3: part_2_out = in_cbrt_raw / part_2_tmp_2 // Latency: 31 cycles FP_Single_Div FP_Part_2_3( ... ); // Stage 4: out_cbrt_out = part_1_out + part_2_out // Latency: 14 cycles FP_Single_Add FP_Part_3( ... ); // Delay register, propogate cbrt raw input to output, as next stage's input // Latency: module latency 63 cycles delay_register delay_cbrt_raw_passthrough ( ... ); endmodule
10-22-2019 02:38 PM
10-22-2019 02:38 PM