cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
11,259 Views
Registered: ‎11-02-2011

State machine with large number of states and outputs

Hi all, 

 

I am writing a VHDL entity which has a large number of states to count through the bytes in an input stream (presently about 400, but moving up to potentially thousands). I write this via a code generator that generates reasonably nice code.

 

The states are usually chained -- i.e. there are only very few "branches", and the rest is just a linear progression through the tree. I also want to use the "current state" information in other parts of the code. To me, this sounds like what one-hot would be good at, and I am trusting the synthesiser to do its job there. 

 

All is well and good and everything seems to work well apart from a serious flaw in isim -- it only seems to handle 256 states! I know this, because it wraps around -- when I try to move to state 256 (0 indexed), I see it go back to state 0. 

 

Any ideas? VHDL seems to imply that we should be able to have up to 2^32 items in an enumerated type?

 

Very weird. 

 

Dave. 

 

0 Kudos
16 Replies
eilert
Teacher
Teacher
11,258 Views
Registered: ‎08-14-2007

Hi,

not sure about what you observed in ISIM.

Maybe you could verify this with some other simulator?

 

But having a FSM with so many states just for counting bits seems a little weird to me.

Especially when you do it with OneHot encoding. This will waste a good deal of FFs.

 

How about putting the counting part in a specialized counter (which can be adapted by generics to your specific needs as you mentioned that the number of bits will change.)

Then your FSM remains quirte small and you just have a state where you are enabling the counter and waiting for the ripple output, signalling the end of the input stream. This won't prevent you from using the state code or the counter value anywhere else in your design. It can even help to make things easier. If you need special trigger points during the input count, these can be generated by the counter too.

 

Have a nice synthesis

  Eilert

 

0 Kudos
bassman59
Historian
Historian
11,245 Views
Registered: ‎02-25-2008


@xilinx20@snowdon.id.au wrote:

Hi all, 

 

I am writing a VHDL entity which has a large number of states to count through the bytes in an input stream (presently about 400, but moving up to potentially thousands). I write this via a code generator that generates reasonably nice code.

 

The states are usually chained -- i.e. there are only very few "branches", and the rest is just a linear progression through the tree. I also want to use the "current state" information in other parts of the code. To me, this sounds like what one-hot would be good at, and I am trusting the synthesiser to do its job there. 

 

All is well and good and everything seems to work well apart from a serious flaw in isim -- it only seems to handle 256 states! I know this, because it wraps around -- when I try to move to state 256 (0 indexed), I see it go back to state 0. 

 

Any ideas? VHDL seems to imply that we should be able to have up to 2^32 items in an enumerated type?

 

Very weird. 

 

Dave. 

 


What you're describing might be simpler to deal with using a counter to index through the states that are a linear progression. Basically, you initialize the counter in state X, then spin in state Y waiting for the counter to expire.

 

Pro Tip: I've noticed that if you embed a counter in a state machine, then XST implements it as an adder. So it's best to implement the counter outside of the state machine, and have the state machine assert a counter-load or a counter enable. This will ensure that the counter gets synthesized as an up- or down-counter, which will be faster than an adder.

----------------------------Yes, I do this for a living.
0 Kudos
11,242 Views
Registered: ‎11-02-2011

G'Day, 

 

Thanks for the reply!

 

I haven't actually used another simulator -- are there any other free ones? (I suppose asking whether ModelSim is worth the $ is a little off topic... ). 

 

The beauty of one-hot is that I don't need to stick the counter value through another level of logic in order to generate an "enable" signal. For about half of the states that are in the state machine, some register or another will need to be enabled to load a value from a common data source. My understanding is that to do that would require another level of logic on the counter (which will actually require more than one LUT, since the counter would need to be wider than 6 bits). While the one-hot state signal can be fed directly into the register enable signal (after going through a LUT to combine with a few other control bits). Is my thinking off here? 

 

Since I was generating the code, I changed the generator to, in place of my case-when statement, do something like:

 

-- in a clocked process...

 

if state(mystate1) then

    state(mystate1) <= '0'; 

    state(mystate2) <= '1';

end if;

 

if state(mystate2) then

    state(mystate2) <= '0'; 

    if somecondition='1' then

        state(mystate3) <= '1'; 

    else

        state(mystate4) <= '1'; 

    end if; 

end if; 

 

etc. etc. That simulates fine (although I haven't yet touched on synthesis! ) . 

 

Presumably by doing this, I'm throwing away any chance of XST doing any state-machine specific optimisations. So it'd be better for it to be cogniscent of the state machine, and 256 seems like a fairly arbitrary limit!

 

Dave. 

0 Kudos
eteam00
Professor
Professor
11,237 Views
Registered: ‎07-21-2009

My understanding is that to do that would require another level of logic on the counter (which will actually require more than one LUT, since the counter would need to be wider than 6 bits). While the one-hot state signal can be fed directly into the register enable signal (after going through a LUT to combine with a few other control bits). Is my thinking off here?

 

Yes, your thinking is...  off...

 

The cost of one-hot encoding for a gazillion-state FSM is way WAY more than the cost of another LUT per counter bit.  Does this make sense?

 

Not to mention all the time and effort you are investing in the one-hot gazillion-state design.

 

I assume this is a recreational project, and conserving development time or FPGA resources is not a high priority.  It sounds like you are learning by experimenting, and I applaud you for doing this.  There is precious little time for experimentation in the throes of a high-pressure project.

 

-- Bob Elkind

SIGNATURE:
README for newbies is here: http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

Summary:
1. Read the manual or user guide. Have you read the manual? Can you find the manual?
2. Search the forums (and search the web) for similar topics.
3. Do not post the same question on multiple forums.
4. Do not post a new topic or question on someone else's thread, start a new thread!
5. Students: Copying code is not the same as learning to design.
6 "It does not work" is not a question which can be answered. Provide useful details (with webpage, datasheet links, please).
7. You are not charged extra fees for comments in your code.
8. I am not paid for forum posts. If I write a good post, then I have been good for nothing.
0 Kudos
11,233 Views
Registered: ‎11-02-2011

Many thanks for the reply... It's great to be able to chat with the experts! :-). 

 

It is actually a professional project, but with a rather different optimisation to the one I *think* you're considering. The cost of the FPGA is not an issue, and the projects are small (relative to the size of the FPGA). Therefore power, cost, area and resource optimisation are less of a priority than our real optimisation goal -- performance. The real question is: how do we get the most out of each clock cycle? With that optimisation in mind, do you still think the extra depth in the combinatorial path is more optimal than the large number of registers used?

 

Dave. 

0 Kudos
eteam00
Professor
Professor
11,230 Views
Registered: ‎07-21-2009

The real question is: how do we get the most out of each clock cycle? With that optimisation in mind, do you still think the extra depth in the combinatorial path is more optimal than the large number of registers used?

 

It always depends on ...

 

There are some very nifty and devious tricks for speeding up counters.  And we are the grizzled designers who are most proud of our hacks optimisations.  In a professional project, there is usually a fixed clock frequency target.  Meeting the target is critical, while exceeding the target means you have all your other necessary work completed.

 

Do you have all the speed you need, or not?  There's an abundance of speed-freak digital talent in these forums, but you have to speak up and you have to be specific.

 

-- Bob Elkind

SIGNATURE:
README for newbies is here: http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

Summary:
1. Read the manual or user guide. Have you read the manual? Can you find the manual?
2. Search the forums (and search the web) for similar topics.
3. Do not post the same question on multiple forums.
4. Do not post a new topic or question on someone else's thread, start a new thread!
5. Students: Copying code is not the same as learning to design.
6 "It does not work" is not a question which can be answered. Provide useful details (with webpage, datasheet links, please).
7. You are not charged extra fees for comments in your code.
8. I am not paid for forum posts. If I write a good post, then I have been good for nothing.
0 Kudos
rcingham
Teacher
Teacher
11,223 Views
Registered: ‎09-09-2010

XST is generally good at picking a good-enough state coding given the number of states and the clock period constraint.

BTW, bassman59's advice regarding separate counter processes is spot-on.

------------------------------------------
"If it don't work in simulation, it won't work on the board."
0 Kudos
bassman59
Historian
Historian
11,214 Views
Registered: ‎02-25-2008


@xilinx20@snowdon.id.au wrote:

Many thanks for the reply... It's great to be able to chat with the experts! :-). 

 

It is actually a professional project, but with a rather different optimisation to the one I *think* you're considering. The cost of the FPGA is not an issue, and the projects are small (relative to the size of the FPGA). Therefore power, cost, area and resource optimisation are less of a priority than our real optimisation goal -- performance. The real question is: how do we get the most out of each clock cycle? With that optimisation in mind, do you still think the extra depth in the combinatorial path is more optimal than the large number of registers used?

 

Dave. 


Actually, I think I'm in your boat -- the size/cost of the FPGA is a detail compared to the rest of the system cost. But having said that --

 

I think having a smaller number of states makes things faster, regardless of one-hot or other encoding. Now a trick you can do to speed things up further is to not use a combinatorial count = 0 test in your state machine to see if you should advance to the next state. That's a big comparator if you've got a long time to wait with a fast clock speed. Instead, pipeline it, and register the comparison result. Make a flag:

 

counter : process (clk) is

begin

    if rising_edge(clk) then

        if reset = '1' then

            count <= 0;

            countIsZero <= '1';

        else

            if (load = '1') then

                 count <= init;

            elsif count /= 0 then

                 count <= count - 1;

            end if;

 

            if (count = 0) then

                countIsZero <= '1';

            else

                 countIsZero <= '0';

            end if;

        end if; --- is Reset?

    end if; -- clock

end process counter;

 

Now you'll probably have to adjust things, because the countIsZero flag gets asserted on the clock tick after the counter actually hits zero. But that's easily handled by initializing the counter to one count less than you need. Also, you have to wait for the countIsZero flag to clear after the counter is initialized, which takes two clock ticks after the load is accomplished. The count value goes non-zero on the tick after the load, and then the non-zero count is recognized and clears the flag another clock tick later. You might want to clear the flag with the assertion of load, or something.

 

Anyways, the point is that using the countIsZero registered flag as an input to your state machine is going to be more efficient than using the combinatorial count = 0 in the relevant states.

----------------------------Yes, I do this for a living.
0 Kudos
11,207 Views
Registered: ‎11-02-2011

So, I did think about this... But the problem is that I use enable signals generated by a large number of the states. Maybe half of the ~1000 states cause an independent register to be loaded, or are used as a trigger. Given that, I would end up with ~500 registers anyway, but also a heck of a lot of comparators. I guess the comparators can be pipelined, but we'd also use an extra 500 LUTs to do the comparison (given it'd be a 10 bit counter for 1024 states). I guess my thinking was that it's better to just cop the 1000 single-bit registers to encode the state than it is to use a small number of registers in a counter, and then a large number of registers to decode the state into the enable signals. 

 

All this discussion is a whole lot of food for thought. Great to hear some real depth of knowledge being passed around. 

 

Cheers, 

 

Dave. 

 

 

0 Kudos
13,229 Views
Registered: ‎11-02-2011

We are in a situation where we can never be fast enough. While the clock frequency is fixed, we need to squeeze as much as possible into as few clock cycles as possible -- reducing the pipeline depth is key.

Moreover, this component is a piece of infrastructure that will be re-used, so reducing the pipeline depth now will lead to shorter pipelines in many projects to come!
0 Kudos
eteam00
Professor
Professor
13,227 Views
Registered: ‎07-21-2009

Moreover, this component is a piece of infrastructure that will be re-used, so reducing the pipeline depth now will lead to shorter pipelines in many projects to come!

 

In other words, it is critical that this design is maintainable (and readable).  The notion of maintaining a state machine with 1000+ states is... worrisome.  It reminds me of a string of Christmas lights a mile long, where a single bad bulb causes all of the bulbs to go dark.

 

Especially in state machine, there are many many ways to reduce pipeline depth.

 

-- Bob Elkind

SIGNATURE:
README for newbies is here: http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

Summary:
1. Read the manual or user guide. Have you read the manual? Can you find the manual?
2. Search the forums (and search the web) for similar topics.
3. Do not post the same question on multiple forums.
4. Do not post a new topic or question on someone else's thread, start a new thread!
5. Students: Copying code is not the same as learning to design.
6 "It does not work" is not a question which can be answered. Provide useful details (with webpage, datasheet links, please).
7. You are not charged extra fees for comments in your code.
8. I am not paid for forum posts. If I write a good post, then I have been good for nothing.
0 Kudos
13,225 Views
Registered: ‎11-02-2011

Fair point!

What (I think!) absolves me of that problem is that the VHDL is generated (using a python program, based on some input data). It means that I don't need to maintain the state machine code -- future projects will use the same generator program, so it's important to get the code which it generates correct.
0 Kudos
rcingham
Teacher
Teacher
13,218 Views
Registered: ‎09-09-2010

Let us hope that the next version of Xilinx Synthesis tool infers the same sort of state machine logic from your generated code that the current one does, or you may have to rewrite your Python programme...

------------------------------------------
"If it don't work in simulation, it won't work on the board."
0 Kudos
arquerc
Observer
Observer
13,162 Views
Registered: ‎04-29-2010

Hi bassman59,

 

I know it has been some time but I was reading through this post and I wanted to ask you something. You proposed some code to avoid using combinatorial logic to know when count = 0, however, I don't see how you could ever do that. I mean, doesn't your code do pretty much the same thing? basically you test with every clock rising edge weather count = 0 and if so you rise countIsZero, won't this be implemented as a comparator? (either with gates or LUTs)

 

The input value of the counter can change on every rising edge of clk and the output has to be ready when the next rising edge comes, so I don't see how this will make things faster, if you could enlighten me I would be very thankfull.

 

The only way I see for doing this is by breaking up the comparison in various intermediate values..

 

Thanks!

0 Kudos
bassman59
Historian
Historian
13,155 Views
Registered: ‎02-25-2008


@arquerc wrote:

Hi bassman59,

 

I know it has been some time but I was reading through this post and I wanted to ask you something. You proposed some code to avoid using combinatorial logic to know when count = 0, however, I don't see how you could ever do that. I mean, doesn't your code do pretty much the same thing? basically you test with every clock rising edge weather count = 0 and if so you rise countIsZero, won't this be implemented as a comparator? (either with gates or LUTs)

 

The input value of the counter can change on every rising edge of clk and the output has to be ready when the next rising edge comes, so I don't see how this will make things faster, if you could enlighten me I would be very thankfull.

 

The only way I see for doing this is by breaking up the comparison in various intermediate values..


 

The difference is that when you test for count = 0 in the state machine, your comparison logic is more complicated than in a process (or the simple expression) dedicated to that comparison.

 

Remember what happens in a state machine. The machine has a bunch of outputs, whose outputs depend on what state you're in and whatever logic is in each state.

 

Assume you have a machine with three states, and an output bit is controlled in each state (assume a synchronous process, clock and such not shown for clarity):

 

    Decoder : case state is

        when S_FOO :

            outbit <= inbit and gate;

        when S_BAR :

            if count = 0 then

                outbit <= qbit;

            end if;

        when S_BLETCH : 

           outbit <= fetzer and fitzer;

    end case Decoder;

 

Consider the hardware required to build the "D" input to the register outbit. After trying to do so for a few minutes and describe it, I gave up. But suffice it to say, it's pretty complicated. And if the counter is, say, 32 bits wide, count = 0 is a pretty wide combinatorial thing, and it's combined combinatorially with all of the other assignments from the other states. That's why it can be so slow.

 

Now separate out the comparator and have it set or clear a bit depending on the count:

 

    if (count = 1) then

        countIsZero <= '1';

    else

        countIsZero <= '0';

    end if;

 

So instead of having a 32-bit compare wrapped in with all of the other logic that is combined to drive the D input of outbit, you have a simple 1-bit compare (countIsZero = '1'). So you've significantly reduced the amount of logic in the entire tree of logic used to decide the nextstate of outbit. Of course it comes at the cost of a pipeline delay, which is why if you want the state machine to move ahead when count = 0, you have to test for count = 1 and assert countIsZero then (this assumes that indeed on the next clock, count will decrement to zero).

 

got it?

----------------------------Yes, I do this for a living.
dawatson
Observer
Observer
8,518 Views
Registered: ‎05-15-2012

Hey guys

 

I know this thread is old, but I just wanted to say how informative it was! I recently had an issue with a state machine with 18 states (trying to make serial output of different values easy) and found that only 16 states where being implemented. Some states were optimsed away or not implemented - but worked in simulation!

 

@bassman59 I just used your trick in comparing values outside state machines. I never thought about the implications of doing such a comparison within a state machine until I found this thread. 

 

Every day's a school day!

 

Cheers

 

David

0 Kudos