07-15-2016 02:20 PM
In an application I need to pack random length bit fields that will eventually go to a 64-bit output stream. For definitiveness, the lengths are roughly between 1 and 32, heavily skewed to the low side. My plan was to stuff a local ap_uint<64> variable till it was full, then transfer it to the output. (Is this even a good idea?)
Is it better to continuously shift bits into this local variable, something like
buf <<= nbits; but |= bits;
If so, should/can I make it explicitly known to Vivado that buf is used as a shift register? The ap-style shift registers seem geared towards shifting in fixed size fields, so those don't fit the problem.
Is it better to use the range selection (assuming cur = the starting top-most bit position) to stuff the bit field into? e.g.
buf(cur, cur - nbits - 1) = bits;
I could experiment, but so much seems dependent on the environment, that one could easily come to the wrong conclusion.
Just an FYI about why this important.
With 32 channels being done in parallel and 2 - 4 bit inserts in each channel, the shift registers are chewing the LUTs quickly. The final project has 5Million channels, so the number of channels that a ZYNQ can process has a big impact on cost. First look says I can do 256 channels (8 groups of 32 channels in parallel). Increasing this to 512 (8 groups of 64 channels in parallel) effectively halves our cost.
07-18-2016 05:17 AM
I can't think of a really good way to do this, in HLS or HDL. If you want to be able to put a bit of input data into any of 64 output bits in a single cycle, then that's going to require a pretty huge de-multiplexer. If you want to put multiple bits in simultaneously, you'll need multiples of the same hardware. A shift register would be efficient if you can afford lots of clock cycles (ie let it shift one position per cycle until everything is lined up), but getting HLS to do that may be tricky (it'll probably want to unroll the loop, which will get you back to the same situation you had before).
I suspect that you'll be better-off using something like 8-bit or 16-bit values, storing at most one element each (ie so if you have a 3-bit input, it occupies a whole 8-bit or 16-bit value). 32-bit values would, of course, get four 8-bit values. Then you just feed them into a FIFO, and at the other end of the FIFO you take 8-byte sets to produce a 64-bit output.
07-18-2016 09:49 AM
The application is a data compressor, so placing the bits into 'conveniently' sized, but not densely packed elements defeats the purpose.
Note: this purely a sequential stuffing, there is no 'random' access needed to stuff the bits into the output word. Does that change the equation any?
Given that typically 1-6 bits are being inserted at a time, would going to a smaller intermediate size like 16 or 32 reduce help any. The trade-off is that I would have to absorb more boundary splits and pay the cost of reassembling these into 64 bit values.
As weird as it sounds, could I use a multiply (which would always be by some power of 2) and add (limiting the size to stay within the DSP48s capabilities)? It is a resource that I am not taxing at this point.
07-18-2016 05:53 PM
With pure sequential stuffing, you need to be able to shift bits up by the maximum width of an input. Assuming you're after single-cycle performance, this will be 32 bits. I can't see any way around that being fairly expensive.
I guess that one option would be to only allow shifts of up to, say, 8 bits in a cycle. That way most elements will be a single-cycle insertion, but for big ones you'll have to run over multiple clock cycles (a maximum of four cycles, for a 2-bit value). If necessary you could run this module at a higher clock speed to compensate.
Your DSP48 idea is very interesting. You'll still need to do a bit-shift to get the multiplicand, but that's only a single bit and it's a constant value (ie you're shifting "1" up some amount, not shifting an arbitrary value). I can imagine that being a suitable way to go about it.