10-18-2020 03:57 PM
I am just wondering what is the Target Data Throughput for?
I don't quite understand what is this for? For example, in my case, the clock is running at 256 MHz, and each sample for me is an actual symbol. So, should I be setting the 256 MSPS?
I am wondering what is an example of a case that data throughput is less than the clock throughput? The only case that I can think of is that we have multiple samples per symbol. So maybe every few sample would represent one symbol. Then, I can see the clock rate and MSPS be different. But then how would this result in any meaningful FFT? Yes, technically we can still compute the FFT, but if MSPS is not the same as clock rate, then I am not sure what FFT means in that case! I would think ultimately you would want to have one symbol every clock, which will make clock rate and data rate the same.
Can someone explain the target clock and target data rate?
10-18-2020 05:15 PM - edited 10-18-2020 05:27 PM
There are many configuration options for the FFT Core which are described in the well-written document, PG109. In the Wizard for this core, I recall that you could specify some target parameters such as throughput that help guide you and the Wizard through other configuration options for the core.
However, like you, I expected all configuration options would result in a core that operateed in real-time. That is, if we are doing an N-point FFT and it takes N-cycles of CLK1 to collect the data samples then it should take N-cycles of CLK1 to produce all the spectral samples of the FFT (after some latency). I was surprised to learn that some configurations for the core are not real-time (ie. the different configurations for the core are used to tradeoff speed and resource-usage). -but some configurations are real-time (see Throttle Scheme and Pipelined Streaming I/O in document, PG109).
In our application, we wanted to do other things with the data samples before sending them to the FFT core. These "other things" sometimes added on to the N-cycles of CLK1 that it took to collect the data samples. So, we needed faster than real-time operation of the core. Fortunately, we were able to cross the data samples into the CLK2 domain where CLK2 is twice the frequency of CLK1. Then, we used CLK2 for doing "other things" with the data samples and for clocking the FFT core.
Finally, see the following link for performance and resource utilization of the FFT core. For some UltraScale devices, I see values of FMAX over 512MHz for the core.
10-19-2020 10:18 AM
Thanks for the document. I would be interested in Pipelined Streaming I/O with no Cyclic Prefix Insertion mode. Because, my upstream master provides continuous data with no gap. According to figure 3.42 from the following document, s_axis_data_tready is always asserted high. But I am curious this cannot be the general case?!
Also, under Throttle Scheme, I couldn't draw a clear line between realtime and non-realtime?! In my application I don't care if core unloads the data with so much delay. My main concern is to make sure that s_axis_data_tready is always going to stay high, since my upstream device is continuously providing incoming data with no gap.
Any idea if there is such a configuration that core can support? It seems Pipelined Streaming I/O with no Cyclic Prefix Insertion mode is the correct mode for me. How about realtime vs non-realtime? There is no condition specified clearly in the document that core can continuously operate and can ignore s_axis_data_tready!
I would assume there has to be a configuration that the core can support uninterrupted incoming data without the need to wait. Maybe it has to use lot more resources, but there needs to be such a configuration. As I mentioned, I don’t care about the amount of latency of the core, or whatever fashion the output data looks like.
Any comment or advice?
10-19-2020 01:56 PM
As the author of an FFT engine, I can comment on why you might want to configure the core to accept 1 sample every other clock, or even one sample every third clock: Every butterfly but the last two in my radix-2 FFT requires 3 multiplies. (The last two butterflies can be accomplished with adds/subtracts alone.) If the FFT engine knows that no more than one data sample will arrive every other clock, than two of those multiplies may be multiplexed. If one in three, then you can reduce the area usage down to one multiply per stage.
I'm not sure how Xilinx's FFT handles this, but I have to believe they've done some similar performance optimizations as well.
10-19-2020 04:51 PM - edited 10-20-2020 12:40 AM
I would be interested in Pipelined Streaming I/O with no Cyclic Prefix Insertion mode.
This is what we use.
Also, under Throttle Scheme, I couldn't draw a clear line between realtime and non-realtime?! In my application I don't care if core unloads the data with so much delay. My main concern is to make sure that s_axis_data_tready is always going to stay high..
The discussion called "Controlling the FFT Core" on pages 50-51 has some answers for you. In short, I understand pipeline streaming realtime to mean that when the core is ready to receive data then you must send data to the core. Also, when the core is ready to send data then you must receive data from the core. Also, during loading/unloading data to/from the core, we cannot insert AXI wait states. However, near the top of page 51 it says, "Note that the core can still insert waitstates when in Realtime mode". So, now I'm not sure that the realtime in PG109 is the uninterrupted receipt and output of data that you want.
During non-realtime, the core may not always be ready to receive data and you do not have to send data to the core immediately when the core is ready.
We quickly went to the CLK1-CLK2 method that I described previously, and started using non-realtime processing. So, I did not test whether the realtime core was inserting AXI waitstates during either data input or output. Perhaps you'll do some testing and let us know what you find.
Another thing that might help you is decimation. In our non-realtime use of the core, the ADC was sampling our signal at well above the Nyquist rate. Also, the ADC output samples were sent through a digital filter which further reduced their bandwidth. As a result, samples coming out of the digital filter were at twice the Nyquist rate (because of the reduced bandwidth of the signal). So, we were able to throw out every other sample (ie. decimate the data) before loading it into the FFT core. Throwing out every other sample means that it takes 2*N cycles of CLK1 to collect the N-samples needed for input to the core. Since, the core can produce the FFT in about N-cycles of CLK1 (plus maybe a few wait states), then the core now has plenty of time to process all the data.