cancel
Showing results for
Show  only  | Search instead for
Did you mean:

Xilinx Employee
13 10 7,899

RETIMING DESCRIPTION

Retiming is a sequential optimization technique to move registers across combinatorial logic to improve the design performance without affecting the input/output behavior of the circuit. The circuit shown in Figure 1 has a critical path with a 6-input adder. The path highlighted in red is the path that limits the performance of the whole circuit.

Figure 1 : Example of a register-to-register path with 6-input adder logic

The performance of the circuit shown here can be improved by retiming the registers on the adder output into the combinatorial logic of the circuit.

The overall latency of the circuit is 4. Figure 2 shows one way to move the registers in order to minimize the logic. Moving the output registers into the cone of logic is called backward retiming. When this is done, the critical path is reduced to a 2-input adder.

Figure 2 : Example of a register-to-register path with a 2-input adder by applying backward retiming

One other thing to note about the above examples is that the number of registers has changed.

Figure 1 had 9 different register buses. Figure 2 has 12 different registers buses. The reason for this is that when performing backward retiming, when it is moved from the output to the input both inputs of the gate must now have a register.

There are two different types of retiming, backward retiming and forward retiming. Backward retiming removes registers from the output of a gate, and creates new registers at the inputs of the same gate. Forward retiming does the exact opposite, it removes registers from the input of a gate, and creates a new one at the output.

For backward retiming to work, the combinatorial logic must drive only the register and not fanout to other logic. For forward retiming to work, each input of the gate must be driven by a register with the same control logic.

Figure 3 shows the same circuit with either forward or backward retiming.

Figure 3 : AND gate either being forward retimed or backward retimed

GLOBAL RETIMING VS LOCAL RETIMING

There are two ways to enable automatic retiming in Vivado Synthesis, Global and Local.

Global retiming works on the full design and moves registers across large combinatorial logic structures based on the timing of the design.

It will analyze all of the logic in the design and move registers in the worst case paths in order to make the overall design faster. In order for this to work, the design must have accurate timing constraints in the .xdc file. Global retiming is enabled with the -retiming switch in synth_design or in the Vivado GUI under synthesis settings.  In addition, this feature can also be used with the BLOCK_SYNTH feature in synthesis to target specific modules in your design.

Local retiming is when a user specifically tells the tool which logic to perform the retiming on using the retiming_forward/retiming_backward RTL attributes.

Care should be taken when performing local retiming as it is not timing driven and the tool will do exactly what is asked of it.

ANALYZING MESSAGES FROM THE LOG FILE

Figure 4 shows an example where retiming can improve logic levels. The structure has a critical path of 3 logic levels coming from a 37 bit AND gate. The source register is called din1_dly_reg and the destination register is called tmp1_reg with an extra register after tmp1_reg with 0 logic levels.

This is an ideal path to retime as we can switch to one path with 3 logic levels followed by a path with 0 levels to a path with 2 logic levels followed by a path with 1 or 2 levels.

Figure 4 : Circuit that can be backward retimed

The synthesis log file looks similar to the following:

From this log file you can see the reported logic levels before and after retiming, and the names of the new registers that were created. When synthesis creates new registers from retiming, it will use the suffix "bret" for registers that were backward retimed, and "fret" for registers that were forwards retimed.

Figure 5 shows a circuit where incompatible register elements will make retiming illegal. The structure again has a start register called din1_dly_reg going through a 37 bit AND gate causing 3 levels of logic, and then ending at a register called din1_dly_reg. In addition, the AND gate has a fanout to another register highlighted in pink.

Figure 5 : Example of a circuit that can't be retimed

This example cannot be retimed because of the register highlighted in pink. This register has an asynchronous reset where tmp1_reg does not. Because the two registers do not have the same control set, they are not able to be backward retimed into the AND gate logic. The log file in this example will show the following:

The log file includes a message about incompatible flip-flops and the before and after logic levels do not change.

Retiming cannot happen in the following situations:

1. Timing Exceptions on a register (multicycle paths, false paths, max delays)

2. Keep type attributes on the register (DONT_TOUCH, MARK_DEBUG)

3. Registers with different control sets

4. Registers driving outputs or being driven by inputs (unless design is marked as out-of-context).

Example where retiming is unable to improve the critical path in a feedback loop:

When a path has the same source and destination register, retiming optimization might not be able to improve logic levels.

For example:

The critical path for the register “dout_reg” is highlighted in red. It goes through a reduction AND operator and ends at the reset pin of the same register.

The reduction AND operator will consume 2 logic levels based on the width which we have i.e. 16-bit.

The Screen capture below shows how synthesis describes the nature of the critical path.

It also mentions the cell names which are part of the critical path.

Thanks to Chaithanya Dudha who is the original author of this article.

Explorer

Does item 3 also cover clock domain crossings so they under no circumstances gets retimed? Or will such paths get optimized if the clocks do not have any asynchronous clock group setting?

Xilinx Employee

@tsjorgensen, yes item 3 = different control sets covers clock domain crossings.

A control set is the combination of the clock, clock enable and reset signal. A different clock means a different control set, so such registers will never be retimed.

Best regards

Dries

Visitor

Very very interesting article. Thanks a lot for it.

I would like to share my experience and try to understand if one of you have any comment on something that I could do better.

I have some latency-critical algorithms that I need to implement for Xilinx Virtex Ultrascale+ FPGAs.
My initial idea was to describe the algorithm as a large combinatorial path connected to  a N number of register stages at the output, at which N is given by a generic parameter.
Then by implementing the design for different values of N using retiming, I would be able to discover the minimal number of register stages I would need to reach timing closure for each algorithm.

My framework to get it implemented in the FPGA is the following:

In view of reducing I/O timing effects in the analysis, the FPGA top level had one bit input, one bit output and two clocks, one for the DUT logic and one for the wrapper. A shift register was used to generate all the DUT inputs. All DUT outputs were ORed into a single bit.
As I am not interested in retiming the wrapper, I used the BLOCK_SYNTH.RETIMING atribute in the DUT instance at which I wanted to retime.

Using Vivado Synthesis I did not see any significant improvement in terms of register balacing for the critical path.
However using Synplify, I managed to see the number of logic levels and path delays being reduced by half when the first register was added, but no improvements were seen for N > 1.

I also thought of using Mentor Precision, bit even the latest version still does not support retiming for Ultrascale+ series.  (Do not understand why).

I am not sure if retiming in my case is being limited by:
1) Some mistake in my methodology or use of tools
2) By the tools itself
3) By retiming bottlenecks related to the algorithm I am investigating.
4) More than one of the above conditions

Do you have any comments on the methodology I am using or the reason why I do not see any improvement when the number of register stages are higher than one?

I just made the project public available here:

https://gitlab.cern.ch/msilvaol/sorting

The DUT: https://gitlab.cern.ch/msilvaol/sorting/blob/master/src/rtl/muon_sorter.vhd
The wrapper: https://gitlab.cern.ch/msilvaol/sorting/blob/master/src/rtl/wrapper.vhd
Scripts to generate Synplify runs: https://gitlab.cern.ch/msilvaol/sorting/blob/master/src/tcl/generate_syn_runs.tcl
Automatic-generated report with the results from several implementations: https://gitlab.cern.ch/msilvaol/sorting/blob/master/rpt/report.csv

Thanks

Marcos

Visitor

Hi @mvsoliveira,

Are you using any DSP in your design ?

I tried retiming on an algorithm based on DSPs and it didn't improve timing att all, whereas I got large improvements on an algorithm based on additionners (LUTs).

Best regards.

KG

Participant

This feature does not work.  Please do not put it in the documentation if it doesn't work; wait ten or twenty years until you actually get it functional.  I had a combinational design (all LUTs, no feedback) with 129 levels of logic.  There were registers at the inputs and 20 levels of registers at the outputs, to which I applied the BACKWARD_RETIMING directive.  The entire bank of registers was moved back one level, so that I had a critical path of 128 levels of logic in front and one level in back.  No, that did not meet timing.

Xilinx Employee

I'm sorry to hear that. In all cases I came across, we always found a logical explanation why the FFs could not be moved further.

Would you be able to share a testcase so that we ask development to investigate why it's not retiming more in your case?

I can send you an EZ-move package over which you can share it securely if necessary.

Best regards

Dries

Participant

Dries,

I appreciate the offer.  I already hand-pipelined my design, and I feel like I already spent too much time testing out this feature and I have to move on.  I think if it really worked, everyone would be using it all the time.  I'm not going to try it, but here's a simple example that comes up in my work:  a wide XOR:

``````module wide_xor(input i_clk, input [255:0] i_data, output o_xor);
wire xor_comb = ^i_data;
(*retiming_backward = 1 *) reg [0:4] shifter;
always@(posedge i_clk)
shifter <= {xor_comb, shifter[0:3]}; // rightshift
assign o_xor = shifter[4];
endmodule ``````

This xor should be ceil(log6(256))=4 levels of logic, so I would expect to see input registers, output registers, and no more than 1 level of logic between  flipflops.

Participant

I don't know why I bothered, but since I wrote that code above I decided to run it through the synthesizer.  Results were poor.  The registers are moved around in a way I don't understand.  The critical path went from 4 levels of logic to 3 (instead of 1), and the latency is all bunched up in SRLs.  I didn't want to just add latency; I wanted pipelining between levels of logic.  This is about the simplest example one could come up with, so if it doesn't work for this, I would say it doesn't work at all.

Xilinx Employee

Local retiming using RETIMING_FORWARD/BACKWARD only support retiming of 1 logic level.

In the next release (2020.2) we will support more than 1-logic level.

We are enhancing the local retiming optimization in 2020.2 to have more than 1-logic level supported.

The ideal use-case is that you use global retiming  (using -retiming switch) or retiming using BLOCK_SYNTH for such structures as these are timing driven and will reduce the logic levels as much as needed.

Any reason why those two options were not used/didn't work in your real design?

Best regards

Dries

Xilinx Employee

ps: I tested with your testcase and depending on the target frequency, Vivado synthesis retimes to the full extend and keeps only 1 logic level.

Latest Articles