We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for
Did you mean:
Highlighted
Visitor
221 Views
Registered: ‎10-08-2019

## How to accelerate the floating point implementation for cubic root on VU9P

Greetings!

I'm trying to implement a cubic root function on floating point numbers, which is not existing in the IP category. I basically followed the Newton-Raphson method as proposed in the post here.

I used the single precision FP IP on VU9P to implement those FP add (14 cc latency), multiplication (9 cc latency) and division (31 cc latency) functions. However, due to the data dependency in there, those functions cannot be executed in parallel. My Verilog code is attached.

Based on my simulation, I found that even with proper table lookup implementation for that intial guess value, it still needs a few (two or more) iterations to get an relatively accurate value, which will take a few hunderds of cycles to finish.

While in comparision, the latency for IPs like FP exponential (32 cc latency), FP log (31 cc), FP square root (31 cc) etc. can return relatively much better accuracy with shorter latency. Can anyone shed some light on how are those IP's calculate those values? What kind of algorithm they use? Is there a better way to accelerate my cubic root implementation?

Thank you!

```module cbrt_inner_loop_single
#(
parameter DATA_WIDTH = 32
)
(
input clk,
input rstn,
input [DATA_WIDTH-1:0] in_cbrt_raw,
input in_cbrt_raw_valid,
input [DATA_WIDTH-1:0] in_cbrt_guess,
input in_cbrt_guess_valid,
output [DATA_WIDTH-1:0] out_cbrt_raw,
output out_cbrt_raw_valid
);
// Stage 1: part_1_out = 2/3 * in_cbrt_guess
// Latency: 9 cycles
FP_Single_Mul FP_Part_1(
....
);

// Delay part_1_out from FP_Part_1 by 9+31 = 40 cycles
// Wait for Part_2_tmp_2 to finish
delay_register delay_part_1
....
);

// Stage 1: part_2_tmp_1 = in_cbrt_guess ^ 2
// Latency: 9 cycles
FP_Single_Mul FP_Part_2_1(
...
);

// Stage 2: part_2_tmp_2 = part_2_tmp_1 * 3
// Latency: 9 cycles
FP_Single_Mul FP_Part_2_2(
...
);

// Delay cbrt_raw from input by 9*2 = 18 cycles
// Wait for Part_2_tmp_1 and Part_2_tmp_2 finish
delay_register delay_cbrt_raw
(
...
);

// Stage 3: part_2_out = in_cbrt_raw / part_2_tmp_2
// Latency: 31 cycles
FP_Single_Div FP_Part_2_3(
...
);

// Stage 4: out_cbrt_out = part_1_out + part_2_out
// Latency: 14 cycles
...
);

// Delay register, propogate cbrt raw input to output, as next stage's input
// Latency: module latency 63 cycles
delay_register delay_cbrt_raw_passthrough
(
...
);

endmodule ```

1 Solution

Accepted Solutions
Teacher
189 Views
Registered: ‎07-09-2009

## Re: How to accelerate the floating point implementation for cubic root on VU9P

is newton an efficient way of doing a cube root ?
"hackers delight" book seems to indicate its not
https://doc.lagout.org/security/Hackers%20Delight.pdf

this looks interesting
https://docplayer.net/25431850-Fpga-implementation-of-a-binary32-floating-point-cube-root.html
Teacher
190 Views
Registered: ‎07-09-2009