We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Showing results for 
Search instead for 
Did you mean: 

Retiming in Vivado Synthesis

Xilinx Employee
Xilinx Employee
12 3 2,019


Retiming is a sequential optimization technique to move registers across combinatorial logic to improve the design performance without affecting the input/output behavior of the circuit. The circuit shown in Figure 1 has a critical path with a 6-input adder. The path highlighted in red is the path that limits the performance of the whole circuit.

image.pngFigure 1 : Example of a register-to-register path with 6-input adder logic

The performance of the circuit shown here can be improved by retiming the registers on the adder output into the combinatorial logic of the circuit.

The overall latency of the circuit is 4. Figure 2 shows one way to move the registers in order to minimize the logic. Moving the output registers into the cone of logic is called backward retiming. When this is done, the critical path is reduced to a 2-input adder.

image.pngFigure 2 : Example of a register-to-register path with a 2-input adder by applying backward retiming

One other thing to note about the above examples is that the number of registers has changed.

Figure 1 had 9 different register buses. Figure 2 has 12 different registers buses. The reason for this is that when performing backward retiming, when it is moved from the output to the input both inputs of the gate must now have a register.

There are two different types of retiming, backward retiming and forward retiming. Backward retiming removes registers from the output of a gate, and creates new registers at the inputs of the same gate. Forward retiming does the exact opposite, it removes registers from the input of a gate, and creates a new one at the output.

For backward retiming to work, the combinatorial logic must drive only the register and not fanout to other logic. For forward retiming to work, each input of the gate must be driven by a register with the same control logic.

Figure 3 shows the same circuit with either forward or backward retiming.

image.pngFigure 3 : AND gate either being forward retimed or backward retimed


There are two ways to enable automatic retiming in Vivado Synthesis, Global and Local.

Global retiming works on the full design and moves registers across large combinatorial logic structures based on the timing of the design.

It will analyze all of the logic in the design and move registers in the worst case paths in order to make the overall design faster. In order for this to work, the design must have accurate timing constraints in the .xdc file. Global retiming is enabled with the -retiming switch in synth_design or in the Vivado GUI under synthesis settings.  In addition, this feature can also be used with the BLOCK_SYNTH feature in synthesis to target specific modules in your design.

Local retiming is when a user specifically tells the tool which logic to perform the retiming on using the retiming_forward/retiming_backward RTL attributes.

Care should be taken when performing local retiming as it is not timing driven and the tool will do exactly what is asked of it.

For more information on the use of retiming, please refer to (UG901) Vivado Design Suite User Guide : Synthesis.


Figure 4 shows an example where retiming can improve logic levels. The structure has a critical path of 3 logic levels coming from a 37 bit AND gate. The source register is called din1_dly_reg and the destination register is called tmp1_reg with an extra register after tmp1_reg with 0 logic levels.

This is an ideal path to retime as we can switch to one path with 3 logic levels followed by a path with 0 levels to a path with 2 logic levels followed by a path with 1 or 2 levels.


image.pngFigure 4 : Circuit that can be backward retimed

The synthesis log file looks similar to the following:









From this log file you can see the reported logic levels before and after retiming, and the names of the new registers that were created. When synthesis creates new registers from retiming, it will use the suffix "bret" for registers that were backward retimed, and "fret" for registers that were forwards retimed.

Figure 5 shows a circuit where incompatible register elements will make retiming illegal. The structure again has a start register called din1_dly_reg going through a 37 bit AND gate causing 3 levels of logic, and then ending at a register called din1_dly_reg. In addition, the AND gate has a fanout to another register highlighted in pink.

image.pngFigure 5 : Example of a circuit that can't be retimed

This example cannot be retimed because of the register highlighted in pink. This register has an asynchronous reset where tmp1_reg does not. Because the two registers do not have the same control set, they are not able to be backward retimed into the AND gate logic. The log file in this example will show the following:







The log file includes a message about incompatible flip-flops and the before and after logic levels do not change.

Retiming cannot happen in the following situations:

1. Timing Exceptions on a register (multicycle paths, false paths, max delays)

2. Keep type attributes on the register (DONT_TOUCH, MARK_DEBUG)

3. Registers with different control sets

4. Registers driving outputs or being driven by inputs (unless design is marked as out-of-context).


Example where retiming is unable to improve the critical path in a feedback loop:

 When a path has the same source and destination register, retiming optimization might not be able to improve logic levels.

 For example:

 The critical path for the register “dout_reg” is highlighted in red. It goes through a reduction AND operator and ends at the reset pin of the same register.

 The reduction AND operator will consume 2 logic levels based on the width which we have i.e. 16-bit.









The Screen capture below shows how synthesis describes the nature of the critical path.

It also mentions the cell names which are part of the critical path.


Thanks to Chaithanya Dudha who is the original author of this article.


Does item 3 also cover clock domain crossings so they under no circumstances gets retimed? Or will such paths get optimized if the clocks do not have any asynchronous clock group setting?


Xilinx Employee
Xilinx Employee

@tsjorgensen, yes item 3 = different control sets covers clock domain crossings.

A control set is the combination of the clock, clock enable and reset signal. A different clock means a different control set, so such registers will never be retimed.


Best regards


Visitor mvsoliveira

Very very interesting article. Thanks a lot for it. 

I would like to share my experience and try to understand if one of you have any comment on something that I could do better. 

I have some latency-critical algorithms that I need to implement for Xilinx Virtex Ultrascale+ FPGAs.
My initial idea was to describe the algorithm as a large combinatorial path connected to  a N number of register stages at the output, at which N is given by a generic parameter. 
Then by implementing the design for different values of N using retiming, I would be able to discover the minimal number of register stages I would need to reach timing closure for each algorithm. 

My framework to get it implemented in the FPGA is the following: 

In view of reducing I/O timing effects in the analysis, the FPGA top level had one bit input, one bit output and two clocks, one for the DUT logic and one for the wrapper. A shift register was used to generate all the DUT inputs. All DUT outputs were ORed into a single bit. 
As I am not interested in retiming the wrapper, I used the BLOCK_SYNTH.RETIMING atribute in the DUT instance at which I wanted to retime.


Using Vivado Synthesis I did not see any significant improvement in terms of register balacing for the critical path. 
However using Synplify, I managed to see the number of logic levels and path delays being reduced by half when the first register was added, but no improvements were seen for N > 1. 

I also thought of using Mentor Precision, bit even the latest version still does not support retiming for Ultrascale+ series.  (Do not understand why). 

I am not sure if retiming in my case is being limited by:
1) Some mistake in my methodology or use of tools
2) By the tools itself
3) By retiming bottlenecks related to the algorithm I am investigating. 
4) More than one of the above conditions 

Do you have any comments on the methodology I am using or the reason why I do not see any improvement when the number of register stages are higher than one? 

I just made the project public available here:



The DUT: https://gitlab.cern.ch/msilvaol/sorting/blob/master/src/rtl/muon_sorter.vhd
The wrapper: https://gitlab.cern.ch/msilvaol/sorting/blob/master/src/rtl/wrapper.vhd
Scripts to generate Synplify runs: https://gitlab.cern.ch/msilvaol/sorting/blob/master/src/tcl/generate_syn_runs.tcl
Automatic-generated report with the results from several implementations: https://gitlab.cern.ch/msilvaol/sorting/blob/master/rpt/report.csv