cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
javitxu
Visitor
Visitor
854 Views
Registered: ‎09-17-2018

Timing issues when implementing register chain

Jump to solution

Hello everyone,

I am experiencing timing related issues which, to the best I can tell, must be related to the implementation strategy or optimization scope.

Lets start at the beginning: I have a working design which operates at a 50 MHz clock frequency, no special tool optimizations, just standard Vivado strategies. I need it to start working at 100 MHz, which should not pose any problem as there is pretty little logic between registers. But then the routing delays show up and make my design fail timing. No problem: the FPGA is far from crowded and I have no latency constraints so I should be able to create a register chain (no logic in between!) which breaks the route in multiple stages so the design meets timing. Working as I do using an IP integrator based design flow, I create a new IP core with code is as follows:

entity pipeliner is
  generic (size: integer:=1; delay_cycles: integer:=5);
  port(
    input_signal : in 	std_logic_vector(size-1 downto 0);   --Data input
	clk		     : in 	std_logic;
	reset_n      : in		std_logic;
	output_signal: out 	std_logic_vector(size-1 downto 0)   --Floped data
	);
end entity pipeliner;

architecture pipeliner_arch of pipeliner is

	type register_array is array(0 to delay_cycles) of std_logic_vector(size-1 downto 0);
	signal registers: register_array;
begin  -- architecture begin
	process(clk, reset_n)
	begin
		if(reset_n='0') then
			for i in 0 to delay_cycles-1 loop
				registers(i) <= (others=>'0');
			end loop;
		elsif(rising_edge(clk)) then
			registers(0) <= input_signal;
			for i in 0 to delay_cycles-1 loop
				registers(i+1) <= registers(i);
			end loop;
		end if;
	end process;
	output_signal <= registers(delay_cycles);
end architecture;

That code seems to work in simulations, and the post synthesis schematics show a register chain, so I concluded that Vivado inferred just what I wanted.

However, the routing delay is not reduced -at all. Instead of distributing the registers along the route so in each stage the timing is met, the placer puts all registers near the origin so the "jump" from the last register to the destination is virtually the same I had before. It happens the same wether I select an "out of context" or "global" output products generation in synthesis, so it does not seem a synthesis scope problem. Selecting the "post placement optimization" does not solve the problem either: for some reason I dont understand the tool keeps placing all the registers together in one corner. Therefore I end with one stage having ~4.5 ns slack, while the next one fails timing by ~1.2 ns. Is there any command to tell the tool to analyze the possibility of "borrowing" time from one stage to the next? Am I completely overlooking something?

To ilustrate what im saying: this the connection between last register and next block (a FIFO, which explains the connection to all the BRAMs), for one bit:

captura1.png

These are, however, all the connections between the components of the "pieliner" block (only internal connections no input or output connections):

captura2.png

And these are all the cells in the pipeliner core:

captura3.png

(Note that the images are not intended for anything else than giving the general idea of where are the components and how concentrated are in one place).

I finally circumvented the problem by eliminating the "pipeliner" core and selecting a much more aggresive optimization options in Vivado. This solved the problem because the timing failures were few and the slack was not very negative, but I would like to know whether this "register chain" strategy is valid or why not.

Thanks in advance! 

0 Kudos
1 Solution

Accepted Solutions
bruce_karaffa
Scholar
Scholar
805 Views
Registered: ‎06-21-2017

Just a thought, but could this problem be tied to the asynchronous reset?  Perhaps the placer is putting all of the pipeline registers where they are is because of the reset. Do you even need a reset?  If you need it, have you tried making the reset synchronous with clk?

View solution in original post

5 Replies
markcurry
Scholar
Scholar
814 Views
Registered: ‎09-16-2009

100 MHz should be easy for what you're trying to do.  What FPGA are you targetting?  Can you attach some (edited down for length) timing reports that show the problem?

Regards,

Mark

0 Kudos
bruce_karaffa
Scholar
Scholar
806 Views
Registered: ‎06-21-2017

Just a thought, but could this problem be tied to the asynchronous reset?  Perhaps the placer is putting all of the pipeline registers where they are is because of the reset. Do you even need a reset?  If you need it, have you tried making the reset synchronous with clk?

View solution in original post

bitjockey
Adventurer
Adventurer
618 Views
Registered: ‎03-21-2011

To add to what @bruce_karaffa said, it's counter-intuitive but even as an async reset, an areset should be aligned to CLK.  Because there is a timing constraint from it's falling edge to the next CLK you want a full clock period for that.  You can also make it fully synchronous if you like.

Other possibilities.  Did the synthesizer see your delay chain and convert it into a SRL32 pipeline using a single LUT (or several LUTs if > 32) rather than actual, discrete FFs?  You may need some keep/preserve pragmas to prevent this "optimization".  Also for a 100MHz design you shouldn't need more than a 1-2 deep pipe to get "across the chip" and 1-2 deep FF are not (less) likely to try and use an SRL32 I think?

Lastly, on coding style, you shouldn't need a for loop if you use 'RANGE or 'LENGTH attributes of the vector.

 

 

 

registers <= (others => (others => '0'));  or may need (registers'RANGE => (others => '0'));  sometimes compiler is uptight about "non-static" ranges even if it is logically unambiguous.
...
registers(registers'LENGTH-1 downto 1) <= registers(registers'LENGTH-2 downto 0);

 

 

 

are a very familiar design pattern for shifting generic length things.

javitxu
Visitor
Visitor
517 Views
Registered: ‎09-17-2018

Well, using a synchronous reset seems to have solved the issue in an indirect way: now even though all registers are in one corner of the FPGA, the placer seems to have been able to put the last register (for each bit of the data bus) close enough to the next cell , so no timinig errors appear. This is true even without the aggresive optimization strategies I had to use in the first place, so thank you!

It may be that as the placer is now able to solve all timing issues, it does not even try to "spread" all the registers along the FPGA

0 Kudos
javitxu
Visitor
Visitor
511 Views
Registered: ‎09-17-2018

Thanks for this! The sync reset which @bruce_karaffa suggested seems to have solved the issue in an indirect way, but the SLR32 hint was a good one: opening the schematic shows me that the synthetiser has indeed inferred a SLR16 followed by 3 "real" registers (which I find somewhat cumbersome, but Im sure there is a nice reason) . Obviously, once all but three "stages" are implemented by a single cell, that cell cannot be "spread", but to keep the mistery, the placer still keeps all LUT and registers in one FPGA corner. As I said earlier, as now the timing is met, it may happen that the placer no longer tries to "spread" the registers along the FPGA.

Also, thanks for the coding style tip!

0 Kudos