01-30-2020 05:37 PM - edited 01-30-2020 07:25 PM
So the situation I'm in is I want to, along with the rest of the algorithm, divide an integer by another integer, both of which will be defined by the input. Since I know the range of these integers, I'm using a precalculated array to store values I can just multipy the numerator by and then bitshift to get the result. The problem is this array needs to be accessed 300 times over the course of three clock cycles, and the accesses are in no way predictable (e.g it could be accessing a different portion each time, or it could literally be pulling the same part each time). If I am understanding how all this works correctly, that means partitioning it would be useless which is the tough thing since in searching for a solution most of the things I come across are solutions where that pragma is relevant.
So, the way I was trying to do it is, instead of having one array, making a 2D array that holds multiple copies of the same array, partitioning that array in the 1st dimension, and then mapping the different accesses (which are in a loop) to the different copies of the array. This *changed* how it works slightly, but after doing comparisons I'm not sure it actually helped much, which goes against my intuition of how to approach this.
I'm sure I must be missing something here, but in the time I spent looking this is the best way I found to do it so far. So please, if you have any suggestions on how to better address this, let me know. As a quick note interestingly enough it fails in such a way as to need a *faster* interval than I asked for, since I also have a latency max pragma at 3. With the interval at 3, though, the latency goes to 4. So, in addition to the ports, if there are any quick tips on reducing the latency of an array access like this, then I am all ears.
02-03-2020 01:56 AM
Your description does not indicate in detail where the array that's looked up is stored. The behavior will be quite different if the array is in DDR, in BRAM, or in UltraRAM. If you're using DDR, accesses need to be done via one of the AXI ports provided for this (the two or four AXI HP ports (number depends on the model), the ACP or the GP port). In any case, a random access needs 7 cycles to be set up (the reason is tied to the Xilinx AXI mater IP, I asked about it recently at https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/What-is-the-reason-behind-7-wait-states-when-accessing-DDR/td-p/1065864 but didn't get a response). You may be able to pipeline the loop so that the throughput of those accesses is 1 per cycle, but I fail to see how you could get 300 accesses in 3 cycles. You could be able to obtain this with BRAM (or UltraRAM), but then partitioning is key, since each BRAM block is dual ported, and thus you need to spread the 300 accesses over at least 150 blocks (and, more likely, 300). This just doesn't exist in a low-end part such as the Zynq7010. Plus, it will be quite hard to get to work, given your accesses are random.
I would suggest rethinking the approach, 300 irregular accesses in 3 cycles is quite hard to do (if not impossible). If you're doing those just to optimize a division, there are probably better ways to do it. I can't suggest anything more detailed as you are not describing your computation with much clarity.
02-03-2020 02:16 AM
The "correct" way to do it is with a fully-partitioned array, with will result in HLS building at least 100 N-to-1 multiplexers (possibly 300 if it can't figure out how to spread the reads out over the available cycles).
However, I feel like just using a bunch of fast, pipelined dividers might be more efficient.