cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
pzahedi
Visitor
Visitor
648 Views
Registered: ‎06-13-2017

Spartan 3E/ISE timing closure issue - Why I don't see timing issue while I was expecting to see errors?

Jump to solution

 

First, before I explain the details, my specific question is: does the async set/reset pins of registers considered as a source for timing path analysis?  

I am analyzing an old design. Looking for a possible reason for a failure that happened rarely, once in a few month or even a year. Then I found following piece of code which I think it should have generated setup time error but the project reports shows no error. 

OPERATION_PENDING_Proc:

process(ref_clk, reg_wr)

begin

    if reg_wr = '1' and op_pending = '0' then

      op_pending <= '1';

    elsif rising_edge(ref_clk) then

      if clear_op_pending = '1' then

        op_pending <= '0';

      end if;

    end if;

end process;

The 'reg_wr' signal comes from different clock domain (sys_clk). Both 'sys_clk' and 'ref_clk' are generated by a single DCM and clock frequencies are 87.5MHz and 50MHz respectively. 

The clock edges are very close and I think it should have generated some setup time errors.

I checked the ucf file as well there is no relevant exemption for the timing on this part or between these clock domains.

The project compiled using ISE 13.4 and the device is a Spartan 3E (XC3S1600E-4FG320C).

I am wondering first, why I don't see any timing closure errors here and second, do you think it is a reliable way to pass a signal between clock domains?

Thanks,

0 Kudos
1 Solution

Accepted Solutions
avrumw
Expert
Expert
528 Views
Registered: ‎01-23-2009

OK, so there are a couple of things at play in ISE/UCF.

First, lets start with a different case (not the one you described). If you have two clock pins entering the FPGA, and you have independent PERIOD constraints on both of them, these clocks are UNRELATED by default (this is the opposite of Vivado) - all "paths" between these two clocks are not timed  - they are the equivalent of false paths. There is a UCF format that allows you to define these two clocks as related clocks (you derive the TIMESPEC of one of the clocks relative to the TIMESPEC of the other), but unless you do that, the clocks are unrelated. Alternatively, you can put a FROM/TO constraint on paths between the clocks if you want to constrain them.

Second, two clocks from the same DCM are RELATED. So your 50MHz and your 87.5MHz clocks are related - all paths between them are timed. Now, looking at the ratios between these two clocks the timing isn't actually that bad between them; from the 87.5MHz to the 50MHz the worst setup path is 2.86ns - the tools should be able to meet timing on a path if there is no (or very little) logic on the path.

BUT...

Two things... This is definitely considered "bad coding" - having combinatorial logic driving the asynchronous preset/clear pin of an input can always cause a glitch. If the signal had come only from the flip-flop on the other domain, this would have been vaguely "better", and it is an odd coding... It says "if the control signal comes in and the signal is 0, preset it to a 1"; how does this differ than "if the control signal comes in set it to a 1" (regardless of whether it was a 0 before) - the gate seems redundant.

Finally, here is the answer to your question, though. By default, ISE does NOT do timing checks on asynchronous preset/clear pins. This has been a pet peeve of mine for years - it is just WRONG. The timing of these arcs is controlled by the parameter "REG_SR_R", which can be enabled in your UCF (the format is "ENABLE = REG_SR_R;"), but for some devices, the default is that REG_SR_R is disabled by default. Again, this is just wrong, and eventually Xilinx was convinced that this was the case. But Xilinx's approach to these things are "If something passed timing in one version of the tool, we can't change it so that it fails in a later version" (even if it is wrong). So, the solution they took was to change the default for certain families. For Virtex devices, REG_SR_R is disabled for any device up to and including Virtex-5 and is enabled for Virtex-6 and the 7 series. For Spartan, it is disabled for all devices before Spartan-6 - therefore, for your Spartan-3E, the REG_SR_R is disabled.

You can see this in the Constraints Guide (UG625) under the "Disable" and "Enable" command.

So, even though the design probably could meet timing on this path, the path is not timed, so there is no constraint on the routing - therefore it is entirely conceivable that this can cause system failures.

And, this isn't the first time - I have seen at least 2 different customer designs that have had intermittent failures that have remained undiagnosed for years that have been traced back to this fact.

So - to all users that are still using ISE with Virtex-5/Spartan-3/3A/3E.... or earlier and are using asynchronous presets/clears YOU MUST PUT

ENABLE = REG_SR_R;

IN YOUR UCF FILES!

Avrum

View solution in original post

Tags (1)
9 Replies
dpaul24
Scholar
Scholar
635 Views
Registered: ‎08-07-2014

@pzahedi ,

The 'reg_wr' signal comes from different clock domain (sys_clk).

In that case why is the reg_wr not synchronized with the ref_clk and then used? Or is it that it must be used asynchronously?

Async signals should be avoided as much as possible.

Having said that I also have doubts on the code snippet you have posted. What about the op_pending signal? Is it sync or async to ref_clk ?

It's valued is being changed sync and async, that is not correct procedure.

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

pzahedi
Visitor
Visitor
625 Views
Registered: ‎06-13-2017

I agree that the reg_wr should have been synchronized first but I am looking for the possible source of problem in this code to justify the reason of the infrequent failure of system. The question is why it didn't rise a timing error at the first place?

the op_pending is a signal synchronous to ref_clk.

Thanks for the reply.

0 Kudos
dpaul24
Scholar
Scholar
615 Views
Registered: ‎08-07-2014

@pzahedi ,

Well the coding needs to be improved, whatever....

The question is why it didn't rise a timing error at the first place?

I guess that edges of the 87.5MHz and 50MHz clocks are near enough such that most of the time it works and rarely it fails. The ISE tool was also not smart enough to catch it I think.

Have you tried the same code with some Vivado version? Xilinx claims to have improved on their algorithms with Vivado, so it might show you the timing error you are looking for.

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

pzahedi
Visitor
Visitor
603 Views
Registered: ‎06-13-2017

@dpaul24 

Good suggestion. The whole project is a little complicated but I'll try to test this part of code in Vivado.

Do you know if any Xilinx document actually mentioned that this kind of async set/reset with different clock domains may have been excluded from timing closure calculations?

0 Kudos
dpaul24
Scholar
Scholar
593 Views
Registered: ‎08-07-2014

@pzahedi ,

Do you know if any Xilinx document actually mentioned that this kind of async set/reset with different clock domains may have been excluded from timing closure calculations?

I would wait for some Xilinx staff member to comment on this!

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

avrumw
Expert
Expert
529 Views
Registered: ‎01-23-2009

OK, so there are a couple of things at play in ISE/UCF.

First, lets start with a different case (not the one you described). If you have two clock pins entering the FPGA, and you have independent PERIOD constraints on both of them, these clocks are UNRELATED by default (this is the opposite of Vivado) - all "paths" between these two clocks are not timed  - they are the equivalent of false paths. There is a UCF format that allows you to define these two clocks as related clocks (you derive the TIMESPEC of one of the clocks relative to the TIMESPEC of the other), but unless you do that, the clocks are unrelated. Alternatively, you can put a FROM/TO constraint on paths between the clocks if you want to constrain them.

Second, two clocks from the same DCM are RELATED. So your 50MHz and your 87.5MHz clocks are related - all paths between them are timed. Now, looking at the ratios between these two clocks the timing isn't actually that bad between them; from the 87.5MHz to the 50MHz the worst setup path is 2.86ns - the tools should be able to meet timing on a path if there is no (or very little) logic on the path.

BUT...

Two things... This is definitely considered "bad coding" - having combinatorial logic driving the asynchronous preset/clear pin of an input can always cause a glitch. If the signal had come only from the flip-flop on the other domain, this would have been vaguely "better", and it is an odd coding... It says "if the control signal comes in and the signal is 0, preset it to a 1"; how does this differ than "if the control signal comes in set it to a 1" (regardless of whether it was a 0 before) - the gate seems redundant.

Finally, here is the answer to your question, though. By default, ISE does NOT do timing checks on asynchronous preset/clear pins. This has been a pet peeve of mine for years - it is just WRONG. The timing of these arcs is controlled by the parameter "REG_SR_R", which can be enabled in your UCF (the format is "ENABLE = REG_SR_R;"), but for some devices, the default is that REG_SR_R is disabled by default. Again, this is just wrong, and eventually Xilinx was convinced that this was the case. But Xilinx's approach to these things are "If something passed timing in one version of the tool, we can't change it so that it fails in a later version" (even if it is wrong). So, the solution they took was to change the default for certain families. For Virtex devices, REG_SR_R is disabled for any device up to and including Virtex-5 and is enabled for Virtex-6 and the 7 series. For Spartan, it is disabled for all devices before Spartan-6 - therefore, for your Spartan-3E, the REG_SR_R is disabled.

You can see this in the Constraints Guide (UG625) under the "Disable" and "Enable" command.

So, even though the design probably could meet timing on this path, the path is not timed, so there is no constraint on the routing - therefore it is entirely conceivable that this can cause system failures.

And, this isn't the first time - I have seen at least 2 different customer designs that have had intermittent failures that have remained undiagnosed for years that have been traced back to this fact.

So - to all users that are still using ISE with Virtex-5/Spartan-3/3A/3E.... or earlier and are using asynchronous presets/clears YOU MUST PUT

ENABLE = REG_SR_R;

IN YOUR UCF FILES!

Avrum

View solution in original post

Tags (1)
pzahedi
Visitor
Visitor
455 Views
Registered: ‎06-13-2017

@avrumw 

Thank you very much for the complete explanation of this issue. this definitely explains the situation we are facing. I'm going to enable the REG_SR_R and rebuild the project. I'll post the results later.

Thanks again.

 

0 Kudos
dpaul24
Scholar
Scholar
451 Views
Registered: ‎08-07-2014

As usual, an excellent explanation @avrumw 

So - to all users that are still using ISE with Virtex-5/Spartan-3/3A/3E.... or earlier and are using asynchronous presets/clears YOU MUST PUT

ENABLE = REG_SR_R;

IN YOUR UCF FILES!

Such answers should go in to the Xilinx "Answer Record" archives!

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

avrumw
Expert
Expert
440 Views
Registered: ‎01-23-2009

@dpaul24 ,

So there is an answer record AR#14644. But the wording in it is so tepid "This can be an important time check for coming out of reset (even asynchronous reset). For example, you might enable this switch if you want all counters and FSMs to begin on the same cycle."

I would change this to read "This can be an important time check to ensure that you don't sell unreliable products to your customers, have years worth of reports of intermittent failures, spend hundreds of hours investigating this and pulling your hair out trying to find this incorrectly disabled timing check". But, hey, that's just me...

(Sorry - as you can see, this really is a pet peeve of mine...)

But all sarcasm aside, this really does not do justice to the scope of this problem.

What we see here is an unusual example - using an asynchronous preset/clear as part of your functional logic is pretty rare (because it is pretty strongly discouraged) - if you do this with REG_SR_R disabled you are definitely open to all kinds of problems. And this doesn't need to be a clock domain crossing path! Even if reg_wr had come from the same clock domain, this is still a potential failure; on the same clock domain the path should be constrained to one clock period (on top of needing to be glitchless) - but with the REG_SR_R disabled, it wouldn't be timed, so there is no guarantee that the routing would actually be under one clock period.

But even in the more common use of asynchronous presets/clears, which is for global resets, the system is unsafe without REG_SR_R enabled. First, we need to realize that "asynchronous preset/clear" do not mean they can be asynchronous to the clock of the flip-flops - take a look at this post that talks about the dangers of releasing asynchronous resets (preset/clear) asynchronously to the clock.

The normal solution to this is to use a reset bridge to have an "asynchronously asserted, synchronously deasserted" reset signal that can be used to drive the preset/clear pins of flip-flops in a given clock domain.

However, with REG_SR_R disabled, the reset bridge is useless - the tool does not time the paths between the output of the reset bridge and the preset/clear pins of the flip-flops. This means that the reset signal can arrive at different flip-flops with uncontrolled timing. Looking at a given clock edge, the reset may arrive before, "very close to", or after the edge of the clock at different flip-flops. This means that 

  • Yes, different state machines can come out of reset on different clock edges (this is the mildest possible result of this)
  • Different bits of the same state machine can come out of reset on different edges
    • This will yield system failures since you can enter a wrong or even illegal state on this clock edge
  • Any flip-flops in the design (state machine or otherwise) can go metastable
    • This too can crash your system

Furthermore this behavior can change pretty drastically with process temperature and voltage (PVT). As an example, in a fast PVT system it is possible that (by chance) the propagation path of this reset to a given state machine is under one clock period for all flip-flops. In the same design, it is possible that at slow PVT, the propagation path to that same state machine is between 1 and 2 clock periods (somewhere near the middle), so the state machine comes out of reset cleanly on the second clock after the deassertion of reset. But, there is some narrow window of PVT where the reset will arrive at the different flip-flops with a delay of one clock period plus or minus a little. This can cause all the bad results I mentioned early.

Furthermore, this can only happens once per reset cycle - when you apply (and deassert) reset, your system may not come out of reset properly. So if you reset once per day, you only have one chance a day to observe the failure, and it will only happen on some boards at very particular PVT conditions. 

All this to say, this bug can get through all your qualification tests without being caught and make it into your customer's hands, who, over the next years will see the system "occasionally" crash on reset. I am not exaggerating when I say that this can be very hard to find and diagnose.

Avrum