10-13-2011 03:36 AM
I have a simple FIR-Lowpass of 74th order generated with the FIR compiler 5.0.
The model has just one input and one output gateway.
I generated a hw-cosim block for the ML506 board with communication over the network.
Ping-Times: rtt min/avg/max/mdev = 0.120/0.174/0.260/0.045 ms
Now I took the times for two different numbers of samples using tic/toc with the sim command from a matlab script:
SW-Simulation(full) SW-Simulation(cached) HW-Cosimulation
2^14 Samples: 101s 12s 95s
2^16 Samples 116s 38s 147s
2^18 Samples 241s 161s 377s
The SW-Simulation(full) is slow because the cache is cleared.
The times include model loading time, simulation initialisaton and the actual simulation.
So there are offsets included.
The Offset for the ACE-Setup and FPGA Configuration is 77s.
Now, after plotting these results one can see, that the hw-cosimulation is not just slower, but the gap between hw-cosimulation and sw-simulation is increasing with the number of samples. So there will be no point where the hw-cosimulation becomes benefitial for the user.
I'm well aware that these numbers are heavily depending on the model that is used.
That's why I wonder now which kind of models (or xilinx blockset blocks) cause high sw-simulation times and therefore would actually benefit from a hw-cosimulation, despite the communication bottleneck that seems to be some limiting factor and needs to be investigated separately.
10-13-2011 01:46 PM
It indicates that the amount of traffic between ML506 board and the SW might be the bottleneck, and slows the co-sim proportionally to the # of samples.
One quick test to validate this theory is to measure the amount of data sent on the network (e.g. with Wireshark) as a function of # of samples.
From my experience (which is also a common sense), co-sim has the most speedup comparing to sim when sufficiently large designs run on the board and have little communication overhead (like transactions).
10-13-2011 10:11 PM
thanks for confirming my assumptions.
Reducing the communication overhead is a good idea, but how can it be done and what kind of applications could be used for that.
At the moment I'm simply streaming the samples and have a design that mainly consists of a high number of DSP48 Blocks. It seems like these blocks are easily simulated by Matlab. The FIR model is kept simple, because it is intended to be used in a presentation. Yet my hope was that the high number of DSP48 elements would cause more load for the simulator than it actually does.
10-13-2011 10:24 PM
One option is to have a synthesizable testbench that includes self-generating source of traffic (e.g. pseudo-random generator of samples). There is always a challenge to come up with the "expected value" checker such that the output is not uploaded but checked on board as well. If there is a mismatch between expected and actual output, the error counter is incremented.
This way the whole system is running on board and the communication overhead is minimal (only polling for status).
I don't know if this scheme is applicable to your design.
10-13-2011 10:35 PM
yes, that sounds like a good idea.
I need to work out some details but it should be possible to create a simple design with the mentioned properties.
*brain shifting into higher gear now*
Thanks for the suggestion.
10-17-2011 03:26 AM - edited 10-17-2011 03:28 AM
just made another measurement of simulation times.
This time the FIR-LP was of 219th order, almost three times the size than for the test before.
Now things drastically changed:
SW-Simulation(full) SW-Simulation(cached) HW-Cosimulation
2^14 Samples: 256s 129s
2^16 Samples 543s 545s 143s
2^18 Samples 2237s 2197s 342s
Now the Simulink based simulations are way slower than the hw-cosimulation.
And there again is no convergence or crossing point expectable with rising sample times.
(So, communication overhead is no bottleneck anymore in this case.)
But the most interesting point is this:
Comparing the times for the hw-cosimulation of the two models shows that the size of the model seems to have no influence at all. Only the number of samples is important.
However, this might be because the two models are very much alike, exept for the size.
And the samples are exchanged as a continuous stream.
Other models might behave different.
10-17-2011 08:12 AM
I assume you don't monitor all the internal signals in 74th order and 219th order FIR - only the interfaces. If that's the case, then the co-sim numbers make sense. The amount of uploaded data remains about the same between the two experiments. The co-sim design on the board runs at much higher speed that the sw part (MHz vs KHz), therefore the increase in the design size doesn't have noticable effect of the co-sim runtime.
10-17-2011 11:05 PM
you are right. The model is intended for showing simulation speed enhancement.
Also there's no access to the internal stages of the FIR anyway when using the FIR Compiler 5.0 (or any other).
About the speed, remember that in the first experiment it was the sw-simulation that was a lot faster.
Of course it is natural for a simulation to need more time on increasing complexity, since it is computed in a sequential manner. One interesting point here is that increasing the FIR-order by a factor of 3 causes an increase of the simulation time by a factor of about 12.
Despite the MHz clocking, in hardware it doesn't matter how long a pipeline may be, after initial latency results are generated on each clock cycle. It's the parallelism that beats the CPU.
Also, in the hw-cosim setup the DUT probably runs significantly slower than the original clock frequency, due to the synchronisation between the network core and Simulink.
(95s-77s) / 2^14 Samples = ca. 1 ms/Sample
(342s-77s) / 2^18 Samples = ca. 1 ms/Sample
so the hw-cosim (after removing the configuratiojn time of 77s) needs about 1 ms per sample constantly.
Slight deviations caused by the ping times could be observed too. In the end there's not much left of the MHz's the board is clocked with. Only that it's needed to run the ethernet core properly, of course.
But I think it's nice to see how much information can be derived from the analysis of such a simple design. Where "simple" has to be seen in the effort spent in the design creation using Matlab/Simulink and sysgen. Wheras the true hardware structure is big and quite complex.