05-04-2020 01:08 AM
I have a 1GHz ADC signal that enters the design at 250 MHz with 4 samples per channel and 2 channels. I want to measure power on a few frequencies in the 45 MHz band. I want to do this in real-time, but latency is not an issue. The preferred solution is to multiply every sample with a sine and cosine value that shifts phase based on the specific frequencies I want to analyze. However, that means I need 8 sine and 8 cosine values at 250 MHz per frequency I want to analyze. The DDS can only provide a single value (although it might be able to give both the sine and cosine at the same time), but does it really work at 250 MHz? If I place the table in block or distributed RAM I cannot read out more than one value per clock cycle. A solution that might be more promising is to generate a sine table up to pi / 2 in Verilog code, as this should mean I could reference it from several blocks at the same time. However, will this work with a table of 64k x 14 entries? I'm using Kintex-7 (KC705). Resource usage is not a big issue since I presently only use 10% of the available LUTs and FFs.
I realize I could skip 3 out of 4 samples and use the average of the channels, but this feels like a bad solution with poorer performance. I also would need to duplicate the block RAM or DDS if I want to analyze several frequencies at the same time.
So, is there a way to make multiple references in the same clock cycle to a sine and cosine table at high speed, like at 250 MHz?
05-04-2020 06:27 AM
05-04-2020 01:20 AM
If, as you say, "latency is not a problem", then what you may need is a proper pipelining to have an interval of one cycle. Will a table of 64k x 14 work? I would try it. If values are consecutive, I think you can reach a value per clock.
05-04-2020 01:22 AM
You don't mention (or I missed) your data type. Float products are multicycle, so you will need some paralleled to keep one operation per cycle.
05-04-2020 02:10 AM
Latency is not a problem in the sense that it doesn't matter if the result is available 100 cycles later, but the solution must handle real-time and so it must deliver 8 sine and 8 cosine values at 250 MHz.
The idea is that I can cache sample data in a FIFO until I have a decision if the frequency is part of the data or not. Having a FIFO with hundreds or thousands of values is not a problem. The idea is that I can skip most of the data that is not interesting and only stream data that is relevant over PCIe, which allows for much longer sampling periods.
It's possible I could use smaller tables. It might be enough with 8 bits, and maybe 32k or 16k might work too.
Sample values are 14-bit signed integers. The power estimate will also be some type of integer. In fact, it will come down to a yes-no decision when values are accumulated in an 8k or 16k sequence.
05-04-2020 02:30 AM
While not trivial, I cannot see it impossible to feed and multiply int data on a kintex-7 at 250 MHz
05-04-2020 02:56 AM
The DDS gives you the option for a phase output and can also make simple sine/cosine look up tables. You could have multiple sine/cosine LUTs connected to a single phase accumulator.
05-04-2020 03:42 AM
I assume that the two channels are independent because they do not share the same sine / cosine.
Therefore you have two independent phase accumulators.
But the 4 samples of the same channel are multiplied by 4 consecutive sine in the table.
It is right ?
05-04-2020 05:16 AM
Yes, the channels are somewhat separate, even though mostly because they could have phase differences based on different arrival times of the signals. They come from two antennas placed at different locations.
My original idea was to create a general-purpose sine and cosine table with a fixed number of values up to pi / 2, and with that setup, the values for consecutive samples would not be consecutive in the sine and cosine table.
An alternative approach might be to create a more dedicated sine and cosine table that captures the frequency searched for closely enough with a much shorter table that is tied to the sampling frequency by some factor M / N. For instance, using 16 values and 750 MHz sampling frequency would capture 46.875 MHz. 22 values and 1GHz sampling frequency would capture 45.45 MHz.
05-04-2020 05:41 AM
So divide the "general-purpose sine and cosine table with a fixed number of values up to pi / 2"
in 4 tables "with a fixed number of values" / 4, which consecutives values to read 4 sines in parallel.
1 ROM 8 values(1,2,3,4,5,6,7,8)
4 ROM 2 values(1,5) (2,6) (3,7) (4,8)
Same area but throughput multiplied by 4.
05-04-2020 06:27 AM
05-04-2020 07:32 AM - edited 05-04-2020 07:45 AM
It seems like the CORDIC IP can produce both sine and cosine values in just one clock cycle, and that it might work at 250 MHz with a Kintex-7 device. It doesn't have huge resource usage requirements, and so adding 8 of these should work. It will definitely be worth a try.
Correction: It should be enough with 4 as both channels can operate on the same sine and cosine phase. It's just the accumulators that need to be separated.
05-04-2020 08:07 AM
A CORDIC will require at least one clock cycle per bit of precision at the output.
05-04-2020 08:25 AM
A 64kx14 table would require 32 block RAMs. Each block RAM can be true dual port, so you can use one table to look up two value per clock. To get 8 values per clock you would need 4 of these, so a total of 128 block RAMs. While this is a large number, the Kintex-7 325T (the device in the KC705) has 445 of them, so if you aren't using them heavily for other things, you should be able to use them.
Of course you will have to be careful using RAMs like this - they will be scattered throughout the die so you need to pipeline carefully to use them (including using the output registers of the RAM). But since latency isn't an issue, this shouldn't be a problem.
05-04-2020 11:14 AM - edited 05-04-2020 01:44 PM
From the CORDIC PDF it seems like it can operate in parallell mode and use one pipeline stage per bit, but still will be able to output one sine and one cosine per clock. I can just feed it with an increasing phase value and then use the outputs as coefficients without having to know which phase they correspond to.
Tested it with KC705, and it only uses a bit over 1,000 LUTs and FFs with a 256k x 14 configuration. It works at 187.5 MHz and outputs a new value every clock.