cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Adventurer
Adventurer
7,433 Views
Registered: ‎07-24-2016

FSM weird behavior

Jump to solution

Greetings,

 

I recently encountered a strange behavior in a design, which I want to share it with you, for some feedback and help.

 

An asynchronous process produces the 'sig_done' signal. Its input signals are the 'fifo_empty' signal from a FIFO clocked by a 125 Mhz clock, and a 'user_sig' which is produced by user logic and is in principle tied to gnd. Code snippet: (does the presence of clk_125 in the sensitivity list really matter or affects things?)

 

done_proc: process (clk_125, fifo_empty, user_sig)
begin
    if fifo_empty = '1' and user_sig = '0' then
        sig_done <= '1';
    else
        sig_done <= '0';
    end if;
end process;

 

 

An FSM clocked by a 200 Mhz clock (not related to the aforementioned 125 one) samples this 'sig_done' signal at one of its states, in order to switch to the other one. The 'debug_state' signal is used for ILA debugging. The FSM and the done_proc are in different HDL hierarchical levels. Code snippet:

 

FSMproc: process(clk_200)
begin
    if(rising_edge(clk_200))then
        case state is
        when idle =>
            debug_state <= "00001";
            --[...] more cases...
        when stateCheck =>
            debug_state     <= "10001";
            if(sig_done = '1')then
                state       <= stateCheck;
            else
                state       <= stateDone;
            end if;
             --[...] more cases...
        end case;
    end if;
end process;

The design runs fine for many cycles, but at some point gets frozen. The FSM does not switch states, even though the FIFO is empty, and the sig_done is high. It just seems stuck there. ILAs probing the 'debug_state' signal and the 'sig_done' signal prove this. 'debug_state' is stuck at "10001" while 'sig_done' is high.

 

This design ran fine for months. I just added some non-related logic and this came up.

 

Any inputs? I keep scratching my head with this one...

 

Cheers

 

 

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Guide
Guide
12,872 Views
Registered: ‎01-23-2009

Re: FSM weird behavior

Jump to solution

This sounds like the classic example of what happens when you have illegal clock crossing.

 

The sig_done is generated (combinatorially) by signals that are related to one clock.

 

The FSM is sampling this signal on a different clock. This is a classic clock crossing failure.

 

On the clk_200 edge where sig_done is sampled, what happens if it is in the process of changing or has just changed? There is a propagation path through combinatorial logic from sig_done to the multiple flip-flops that comprise "state". Since sig_done is not synchronous to state:

   - the propagation may arrive at some bits of "state" but not others

   - some of the bits of "state" may go metastable

 

The net result of this, is that after this clock edge, you may remain in stateCheck, transition to stateDone, go to any state that is any combination of the stateCheck and stateDone bits, or have any of these bits go metastable.

 

The last of these things (going to a wrong state or going metastable) can crash your system. A specific case is if the "wrong" state that you end up in is an unsused state. We don't know how many states (and hence state bits) there are in "state", but lets say your state machine has 9 states. This will get coded (assuming binary or Gray coding) in 4 state bits. This leaves 16-9 or 7 states that are unused. What happens if this illegal transition ends up in one of these 7 unused states and there is no way for it to get back into one of the 9 used states; the state machine hangs and no further meaningful state transitions occur.

 

So (and this is a cardinal rule of synchronous design) you may never use an asynchronous input in your design (or an input synchronous to an unrelated clock). The sig_done signal must be synchronized to clk_200 before you use it. It is impossible for us to tell you exactly what the proper synchronizer will look like since we don't know anything about fifo_empty; maybe 2 back to back flip-flops on sig_done before being fed into the state machine is sufficient, and maybe it isn't.

 

Avrum

View solution in original post

9 Replies
Highlighted
Guide
Guide
12,873 Views
Registered: ‎01-23-2009

Re: FSM weird behavior

Jump to solution

This sounds like the classic example of what happens when you have illegal clock crossing.

 

The sig_done is generated (combinatorially) by signals that are related to one clock.

 

The FSM is sampling this signal on a different clock. This is a classic clock crossing failure.

 

On the clk_200 edge where sig_done is sampled, what happens if it is in the process of changing or has just changed? There is a propagation path through combinatorial logic from sig_done to the multiple flip-flops that comprise "state". Since sig_done is not synchronous to state:

   - the propagation may arrive at some bits of "state" but not others

   - some of the bits of "state" may go metastable

 

The net result of this, is that after this clock edge, you may remain in stateCheck, transition to stateDone, go to any state that is any combination of the stateCheck and stateDone bits, or have any of these bits go metastable.

 

The last of these things (going to a wrong state or going metastable) can crash your system. A specific case is if the "wrong" state that you end up in is an unsused state. We don't know how many states (and hence state bits) there are in "state", but lets say your state machine has 9 states. This will get coded (assuming binary or Gray coding) in 4 state bits. This leaves 16-9 or 7 states that are unused. What happens if this illegal transition ends up in one of these 7 unused states and there is no way for it to get back into one of the 9 used states; the state machine hangs and no further meaningful state transitions occur.

 

So (and this is a cardinal rule of synchronous design) you may never use an asynchronous input in your design (or an input synchronous to an unrelated clock). The sig_done signal must be synchronized to clk_200 before you use it. It is impossible for us to tell you exactly what the proper synchronizer will look like since we don't know anything about fifo_empty; maybe 2 back to back flip-flops on sig_done before being fed into the state machine is sufficient, and maybe it isn't.

 

Avrum

View solution in original post

Highlighted
Adventurer
Adventurer
7,400 Views
Registered: ‎07-24-2016

Re: FSM weird behavior

Jump to solution

@avrumw

 

First of all, thanks for the informative answer. I have a question though:

 

Doesn't the "when others => state <= idle;" statement at the end of the FSM, guard you against 'state' metastability, or even the 'state' going in an unknown state? Shouldn't the FSM jump back to idle in these cases when this line is added at the end? Because the FSM has this option at the end of the case, but the problem is still there (the FSM has 21 valid states in total).

 

FSMproc: process(clk_200)
begin
    if(rising_edge(clk_200))then
        case state is
        when idle =>
            debug_state <= "00001";
            --[...] more cases...
        when stateCheck =>
            debug_state     <= "10001";
            if(sig_done = '1')then
                state       <= stateCheck;
            else
                state       <= stateDone;
            end if;
        --[...] more cases...

        when others =>
            state <= idle;
        end case;
    end if;
end process;

 

Thanks again!

0 Kudos
Highlighted
Guide
Guide
7,384 Views
Registered: ‎01-23-2009

Re: FSM weird behavior

Jump to solution

So, first of all, there is nothing that says that it jumps to one of the illegal states instead of one of your other 21 legal states. In that legal state, it may stay there since it is waiting for a condition that never occurs.

 

Furthermore, the "when others" only affects the states you coded. The synthesis tool can detect the state machine and re-code it, and when it does so, the states it creates may not correspond to the ones you created. As an example, if you have a 3 state state machine with the "when others -> idle", in RTL that will map the 4th state to transition to the idle state. If the synthesis tool maps this to a 3 bit one hot state machine, then there are now 3 state bits with 8 possible states (3 legal, 5 illegal). None of the illegal 5 states directly relate to your original 4th state, and none of these illegal 5 states are required to map back to the idle state.

 

Also, because of asynchronous input, your "state" variable may be in one place, whereas your debug_state may be in a different state (which is likely what you were seeing with the ILA)- that's the problem with asynchronous inputs - it breaks all assumptions about your system, since your system no longer behaves in a sane manner.

 

Finally (and I know this is controversial), I don't believe any state machine should have 21 valid states. I almost never code state machines larger than 8 states. It is my opinion (and it is an opinion) that any state machine larger than that should have its functionality broken down - it is almost always possible to code it as 2 independent state machines each with far fewer states, or a state machine and a counter, or a state machine and a couple of status bits... But, again, that's my opinion...

 

Avrum

Highlighted
Adventurer
Adventurer
7,377 Views
Registered: ‎07-24-2016

Re: FSM weird behavior

Jump to solution

@avrumw

 

I see. I will follow your guidelines for upcoming designs...! Thanks again.

 

One more thing: When creating an FSM the way I mentioned in the previous posts, that is, by nesting the "case state is" statement inside an "if(rising_edge(clk))then" don't you get an FSM that its inputs (and outputs?) are getting sampled by the clk? If this is true, then why the "sig_done" is considered asynchronous and needs pipelining before entering the FSM logic? I thought that by coding the FSM this way you infer several FFs for your signals that go in and out of the FSM.

 

So in principle, the aforementioned FSM coding style has no low-level-wise difference when compared to this one below? And is the one way more advantageous than the other?

 

FSMproc_clk: process(clk_200)
begin
    if(rising_edge(clk_200))then
        cur_state <= nxt_state;
    end if;
end process;

FSMproc_main: process(cur_state)
begin
    case cur_state is
    when idle =>
        nxt_state <= state0;
    when state0 =>
        --[...]
    when others => 
        nxt_state <= idle;
    end case;
end process;

Thanks again for your great information

 

 

 

 

 

0 Kudos
Highlighted
Guide
Guide
7,361 Views
Registered: ‎01-23-2009

Re: FSM weird behavior

Jump to solution

Answering the easier question first...

 

The choice of whether you code an FSM as a one process state machine (everything in the if (rising_edge(clk)) or in a two process state machine (one combinatorial process for calculating the next state and one clocked process for the flip-flops for the state bits) makes no difference; they both end up as being the same thing.

 

The first question is more complicated...

 

What you are writing is Register Transfer Language (RTL). In RTL, we explicitly describe the registers; in VHDL they are signals that are assigned values in clocked processes. The rest of the code is all the "transfers" between the registers; these are the expressions on the right hand side of the assignments in clocked processes, all the expressions in combinatorial processes and a bunch of other things. These all end up being implemented as LUTs (and a few other non-clocked FPGA resources).

 

Your state machine therefore defines a number of flip-flops on the 200MHz clock and the combinatorial circuitry to implement the "transfers" - in this case, quite literally; the "state" signals are the flip-flops and everything else is the combinatorial circuitry to implement the transfers from one value of "state" to the other values of "state".

 

So the state machine is operating on clk200 - it will make transitions on every edge of clk200. The transition it makes is determined by the current state and the value of the inputs. However, to make a state transition, the combinatorial circuitry needs time; it takes time for signals to propagate through the combinatorial circuitry before reaching the D inputs of the flip-flops, where they are sampled to determine the new state in the "state" bits.

 

The principle behind synchronous design, is that we give all these combinatorial paths one clock period of time for this propagation. In that time, all the signals can propagate through the combinatorial circuits and then have the correct value on the D inputs ready at the time of the next clock edge. For the paths from "state" to "state" the amount of time we are giving it is exactly 5ns; as long as the propagation through these paths take less than 5ns then the state machine will operate properly at 200MHz. This is what Static Timing Analysis (STA) does - it verifies that all paths take less than 5ns.

 

But, your fifo_empty flag is not synchronous to clk_200, it changes on clk_125, so its changes may occur at any time with respect to clk_200. If the change of fifo_empty comes too close to the rising edge of clk_200, then there won't be enough time for the changes caused by the change in fifo_empty to propagate to the D inputs of the flip-flops before the next clk_200. When this happens, the setup/hold times of the "state" flip-flops may be violated (and hence they can go metastable) or the propagation makes it to some flip-flops by the rising edge of clk_200, but not others, and hence your state machine goes to an incorrect state.

 

In order to avoid this, you must first synchronize fifo_empty to clk_200. While the 2 back to back flip-flops look like pipeline stages, that is not their purpose; they are there to sample the asynchronous signal in such a way that there is only ONE receiver at each stage - the first flip-flop may go metastable, but that metastability should resolve before the 2nd flip-flop samples the signal again. The output of the 2nd flip-flop is therefore

 a) very unlikely to be metastable and

 b) only makes changes on the rising edge of clk_200

 c) no state other than the value of the second flip-flop is affected by the potential metastability of the first flip-flop (that's why its essential that there be only one receiver for the potentially metastable signal - that's what makes the difference between the 2nd flip-flop in the metastability chain and, say, all the state bits in your state machine).

 

Because of this, it can be used in your state machine. Now, like the paths from "state"->"state" the paths from the synchronized fifo_empty -> "state" also have 5ns to propagate through the combinatorial logic. Now if STA says that all paths meet the 5ns requirement, your state machine will function properly.

 

Avrum

Highlighted
Advisor
Advisor
6,790 Views
Registered: ‎10-10-2014

Re: FSM weird behavior

Jump to solution

@avrumw, i came accross this interesting post. You mentioned :

 

"So (and this is a cardinal rule of synchronous design) you may never use an asynchronous input in your design (or an input synchronous to an unrelated clock). The sig_done signal must be synchronized to clk_200 before you use it. It is impossible for us to tell you exactly what the proper synchronizer will look like since we don't know anything about fifo_empty; maybe 2 back to back flip-flops on sig_done before being fed into the state machine is sufficient, and maybe it isn't."

 

Is my conclusion then correct : for any signal coming from external pins that is asynchronous to the FSM clock (i.e. an input on which you need to trigger and thus change state), a synchronizer must be used? Is in such case like a trigger signal entering the FPGA a 2-FF back-to-back synchronizer sufficient (asuming it's pulse width > 3 edges of the 200MHz clock? 

 

I've quiet some books on VHDL, but none of them mentions synchronizing external asynchronous signals in their FSM examples...

 

Is it also correct to state that any async signal entering the FPGA should better go through a synchronizer before being used in any clocked process, wether FSM or just a 'simple' synchronous process? 

** kudo if the answer was helpful. Accept as solution if your question is answered **
0 Kudos
Highlighted
Teacher
Teacher
6,750 Views
Registered: ‎03-31-2012

Re: FSM weird behavior

Jump to solution

@ronnywebers

 

>> for any signal coming from external pins that is asynchronous to the FSM clock (i.e. an input on which you need to trigger and thus change state), a synchronizer must be used? Is in such case like a trigger signal entering the FPGA a 2-FF back-to-back synchronizer sufficient (asuming it's pulse width > 3 edges of the 200MHz clock? 

 

Yes, with the caveat that depending on conditions 2 stage FFs may not be enough and that you need to add async_reg property to all of those registers.

 

>> I've quiet some books on VHDL, but none of them mentions synchronizing external asynchronous signals in their FSM examples...

 

because those books teach you how to design logic, not how to design robust digital ICs which actually work.

 

>> Is it also correct to state that any async signal entering the FPGA should better go through a synchronizer before being used in any clocked process, wether FSM or just a 'simple' synchronous process? 

 

Yes.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
Highlighted
Advisor
Advisor
6,698 Views
Registered: ‎10-10-2014

Re: FSM weird behavior

Jump to solution

Thanks @muzaffer, I wish someone starts writing a better book then :-)

 

can you shed some more light on the caveat you mention:

 

'Yes, with the caveat that depending on conditions 2 stage FFs may not be enough and that you need to add async_reg property to all of those registers.'

 

Q1 :  do you mean that 3 stage might be needed for better MTBF performance?

Q2 : I had a discussion with a collegue who claimed that async_reg is only needed on synchronizer-FFs for input signals (entering an FGPA pin), but non on 'internal' CDC's, meaning a CDC between 2 entities running at different internal clocks. He claims that the rising edges of internal flip-flops are that fast that metastability does not occur, and that there's only a '0' or '1' state. Is that a correct statement? can we omit the 'async_reg' attribute for 'internal' CDC's?

** kudo if the answer was helpful. Accept as solution if your question is answered **
0 Kudos
Highlighted
Guide
Guide
6,678 Views
Registered: ‎01-23-2009

Re: FSM weird behavior

Jump to solution

Q2 : I had a discussion with a collegue who claimed that async_reg is only needed on synchronizer-FFs for input signals (entering an FGPA pin), but non on 'internal' CDC's, meaning a CDC between 2 entities running at different internal clocks. He claims that the rising edges of internal flip-flops are that fast that metastability does not occur, and that there's only a '0' or '1' state. Is that a correct statement? can we omit the 'async_reg' attribute for 'internal' CDC's?

 

This is untrue.

 

Metastability is caused by violating the setup/hold time requirement of a flip-flop, which is guaranteed to happen at some point when a signal generated by one clock is sampled by an unrelated clock. In order to reduce the probability of a metastable event crashing your system you need proper CDC techniques - after all they are Clock Domain Crossing circuits (not just asynchronous input circuits). In an FPGA, the ASYNC_REG property is critical for ensuring that your CDC behaves as well as it can in the presence of potential metastability.

 

While it is true that slow changing input signals can be "worse" in terms of metastability, internal flip-flops are far from immune...

 

Avrum

0 Kudos