cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
wojtek
Adventurer
Adventurer
16,646 Views
Registered: ‎12-16-2008

How to write State Machine code in VHDL to get optimal result from XST synthesis tool ?

Hello 

 

I'm trying to find out how to write FSM (Moore) to get speed-optimal implementation in XST tool

 

I prepared 3 versions of the state machine to check the synthesis results and compare best timing when implemented in FPGA :

 

1) FSM (28 states, 12 registered outputs) generated using StateCAD (with "retain output values" option enabled)

2) the same FSM written 'manually' in one process using "case " statement, outputs are assigned in the FSM state only if that outputs change their value in the given FSM state 

3) the same FSM as in 2) but all 12 outputs are assigned in every FSM state

 

All these 3 FSMs were simulated and their behavior was exactly the same.

 

I got the following results from XST (one-hot encoding) and ISE

                                                 FSM_1         FSM_2      FSM_3

number of FFs                          38                 40              36

number of LUTs                        133               106            52

min. clock period [ns]                5.4                5.1             3.5

 

So FSM with every output assigned in every state gave the best results although I expected second version of the FSMto win - I just though that clock enable input of FF could be used in addition to D and SyncReset inputs but it looks like XST does not handle it in this way ("Use clock enable, sync set and reset" XST options were all enable).

What is your experiance with the FSM coding especiatlly when the max frequency is  critical issue?

 

Regards

Wojtek

 

 

0 Kudos
10 Replies
bassman59
Historian
Historian
16,637 Views
Registered: ‎02-25-2008

Are you really doing a single clocked process for your state machine FSM_2?
----------------------------Yes, I do this for a living.
0 Kudos
schetz
Observer
Observer
16,624 Views
Registered: ‎11-29-2007

I'm a big fan of single, synchronous process for state machine implementation. I did 2-process implementation when I was new to VHDL, but there's no way I'd ever go back.
0 Kudos
wojtek
Adventurer
Adventurer
16,607 Views
Registered: ‎12-16-2008

Absolutely yes: FSM_2 and FSM_3 are written as a single clocked process while FSM_1 is generated by StateCAD in two-process way.
0 Kudos
woutersj
Explorer
Explorer
16,591 Views
Registered: ‎07-27-2009

Hi,

 

If you need the highest speed for your FSM results it is probably worth thinking about the architecture of the FPGA you are looking at.

 

Try to limit the number of states in your design as the state conditions impact the amount of logic on the multiplexers for the 'outputs'. A classical 4-input LUT structure can accomodate any logic function with 4 parameters.

Using complex logic such as e.g adders and comparators to calculate outputs and state transitions can slow your result down. Sometimes you can precalculate the result of a comparison such as a counter reaching zero one or more cycles in advance and store that in a register which the main FSM code can use to reduce the number of logic levels.

Another typical example is a data bus to a memory structure.

 

Instead of writing

 

case state is

  when decide =>

    if (logging and data_matches) then

      D <= data;

      W<= '1';

    end if

...

 

you can write this as

 

case state is

  when decide =>

    D <= data;

    if (logging and data_matches) then

      W<= '1';

    end if

...

 

This transformation causes less control logic on the muxes on the D port and would in general result in a smaller and faster design. Note that putting the data in a register could also speed up things as it would avoid a potentially long combinational path from an external module to the memory. In the same spirit, speed gains can be achieved by registering the memory outputs at the expense of 1 additional latency cycle. Same goes for multipliers: registering the inputs and outputs results in the fastest results on architectures that have embedded multipliers/DSP slices.

 

Try to use shift registers to capture reading from memories with fixed latency instead of explicitly coding states. Shifting a read enable from a memory results in very compact realizations on Xilinx FPGA for the shift register and it offers a very robust alternative to a more complex state machine for memory access by splitting the memory access control from the memory content processing.

 

FIFO structures can also be used to get compacter FSM by splitting up a single big/error prone/slow/... FSM in a number of less complex FSM that have a clean contract interface over FIFOs.

 

Johan

0 Kudos
drjohnsmith
Teacher
Teacher
16,565 Views
Registered: ‎07-09-2009

I'm a little 'concerned' by your comment

 

3) the same FSM as in 2) but all 12 outputs are assigned in every FSM state

 

 

I'd assume all outputs are assigned for every state, even if it's by a default statment, else your implying extra registers ?

 

brain could be off line here, it's alittle late, but a thought !

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
wojtek
Adventurer
Adventurer
16,554 Views
Registered: ‎12-16-2008

Hi woutersj,

 

I read all your tips and I fully agree with all of them. I try to follow these rules when writing FSM code and before I posted my question I once more tried to optimize the FSM in the way discribed by you. Unfortunatelly there are cases where you have to stay with the number of states/conditions, you cannot register the signals and you cannot change the FPGA device. That's why I focused on FSM coding style to see if, having the same functionallity but different coding,  the FSM speed could gain of it.

 

regards

Wojtek

0 Kudos
wojtek
Adventurer
Adventurer
16,545 Views
Registered: ‎12-16-2008

Hi drjohnsmith,

 

You were very careful despite of the late hour.

You pointed correctly:

- in FSM_2  the output assignment appears in the state only if that output is supposed to change it's value in that state

while 

- in FSM_3  the output assignment appears in every state.

 

I can tell you that both FSMs work in the same way in simulation and after synthesis in XST. I expected that FSM_2 will be implemented more efficiently when using CE (clock enable) input and D input of registers. Unfortunately XST rarely uses CE input (if CE is used then D is always hard-wired to either 0 or 1).

Therefore it looks like FSM_3 coding style is the most efficient for the XST synthesis.

I just wanted to know other designer's experience in this matter: I asked my colleagues here and they use FSM_2 coding style because the FSM code more intelligible and compact. But it looks like such FSM uses more resources and is slower than it could be.

 

regards

Wojtek

 

PS. Below I attached the simple example code of the FSM_2 coding style and FSM_3 coding style:

 

FSM_2

 

   fsm_proc : process (CLK)
   begin
      if rising_edge(CLK) then
         if RESET = '1' then
            FSM_states        <= STATE0;
                fsm_out            <='0';
            else
                case FSM_states    is
                    when STATE0 =>
                        FSM_states            <= STATE1;

                    when STATE1 =>
                        FSM_states            <= STATE2;
                        fsm_out                <='1';

                    when STATE2 =>
                        FSM_states            <= STATE3;

                    when STATE3 =>
                        FSM_states            <= STATE4;

                    when STATE4 =>
                        FSM_states            <= STATE0;
                        fsm_out                <='0';
            end case;
         end if;
      end if;
   end process;

 

 

 

FSM_3

 

   fsm_proc : process (CLK)
   begin
      if rising_edge(CLK) then
         if RESET = '1' then
            FSM_states        <= STATE0;
                fsm_out            <='0';
            else
                case FSM_states    is
                    when STATE0 =>
                        FSM_states            <= STATE1;
                        fsm_out                <='0';

                    when STATE1 =>
                        FSM_states            <= STATE2;
                        fsm_out                <='1';

                    when STATE2 =>
                        FSM_states            <= STATE3;
                        fsm_out                <='1';

                    when STATE3 =>
                        FSM_states            <= STATE4;
                        fsm_out                <='1';

                    when STATE4 =>
                        FSM_states            <= STATE0;
                        fsm_out                <='0';
            end case;
         end if;
      end if;
   end process;
 

 

 

 

0 Kudos
woutersj
Explorer
Explorer
16,495 Views
Registered: ‎07-27-2009

Wojtek,

 

I am a bit puzzled by your comments. Are you targeting FPGA or CPLD? For FPGA there is a sweet spot of operating speed that is maintainable when your design changes. Sometimes your module will get good speed but as the device fills up it can suddenly fail timing. These are the times where you have to look at the basic processing and a most of the time some reshuffling of logic and use of embedded resources will fix the issues.

I don't think I have ever run out of slices or registers unless for those cases where some prototype code would use a 'huge' register bank and XST can't map it into a shift register or distributed memory. Memory tended to be the limiting factor but these days the devices seem to have plenty. Note that we try to get an idea about resource requirements in advance by doing some test synthesis and mapping before we actually populate the boards. Having a bigger and smaller pin-compatible device als helps to avoid lengthy engineering runs. For limited series it can make more sense to hook up a bigger device and spend a few dollars more per board than spend a week trying to cram in that last bit of logic.

 

 

0 Kudos
williambhunter
Visitor
Visitor
16,262 Views
Registered: ‎10-25-2009

Makes sense to me. I would not have expected option 1 to be in the running.

Option 2 has states that hold the value of fsm_out, and hence the LUTs must have the output of the fsm_out FF and the current state as a terms for the input to fsm_out FF.

Option 3 computes the intput to the fsm_out FF using only the state, without the fsm_out FF's output, and so the luts would be smaller and the logic faster.

 

If the synthesizer was really smart, it might be able to figure out (in option 2) that it could reduce the logic until there was no FF output term to compute the FF input term, but that is asking a lot of the tool. 

0 Kudos
patkionkar
Observer
Observer
4,043 Views
Registered: ‎03-02-2009

In FSM_2 it seems you do not have default statement for case statement, so when you do not write output for state you need output to be registered for that state. This style will create the latch/flipflop ..While if you write the output for each state, making some combinational logic hence utilizing LUT. This follows the result you have tabulated.
 You can try one more thing. Write the default statement for case and write the output for state only when it is different from default. This result should natch with result of FSM_3.
 Also fsm_2 may not be a good design practice. As you do not write output in state, synthesizer may assume
1. you do not care about output at this state and hence o/p=dont care, in order to optimize the design 
2. Same as previous outout
 
And this may vary from synthesizer to synthesizer. Hence you can get different result with different synthesizers or for that matter different versions of same synthesizer. (typically if you compare ISE6.3 and ISE 7.1 and onwards) ...
 
So either follow fsm_3 or have default statement as said above.
 
--Onkar 
0 Kudos