09-30-2016 08:43 PM
I have an HLS design that uses a fair bit of memory, about 10% of a Virtex-7 485 across 12 seperate memory blocks. I'm trying to hit a high, but I think reasonable frequency target of 300 MHz, because parallelizing to handle multiple work items could require duplicating the memory resources.
Currently I'm having trouble hitting the desired frequency (target is 3ns for exploration, and I'm hitting 3.5-4ns) and the memory seems to be the bottleneck, in particular the paths to the ENBWREN/ENARDEN/REGCEB ports on the RAMB36E1s.
These form the vast majority of the top 200 failing paths.
I have the following resource directive on the arrays
#pragma HLS RESOURCE variable=<array name> core=RAM_S2P_BRAM latency=8
Which should give me a few pipeline stages each on the input/output of the BRAM, and seems to work fine on the data/address paths, but does nothing for the control signals above, which end up with 4-5 levels of logic feeding them.
What I'm trying to determine is whether there are any HLS directives I can use that will help improve situation, or if this is caused by something in the overall architecture of the HLS IP, which current tightly couples the input and output data flows (which is AXI-Stream). I've tried partitioning the BRAMs, but that actually made performance worse (added another stage of logic to the critcal path).
10-01-2016 07:08 AM
I see someone else has run into this issue. I've talked to Xilinx about this very issue problem myself. There is no work around unfortunately, I ended up ultimately having to run my design at a slower clock speed. This is not a problem with your code, this is a fundamental issue with HLS and while I believe Xilinx is working to address it there isn't a solution for now.
10-04-2016 07:13 AM
One option I am exploring is having the memory as external ports (array arguements) on the HLS IP and implementing the memory blocks myself so I can control the pipelining directly. Have you tried this approach/had any luck with it?
10-04-2016 02:30 PM
I did play with this idea and unfortunately you can't avoid the issue. It uses the clock enable to control every single one of the register stages at once so this requires the clock enable to go to the BRAM. I tried registering the clock enable before going to the block RAM but that's not equivalent. My design mostly worked when I did that but there were some corner cases where it stopped working. The state machine actually drops that clock enable to suspend the pipeline and is built around that happening immediately. I discussed this with my peers and we couldn't find a work around. If you figure out a clever solution let me know.
10-17-2016 07:55 AM
I can see how that would be an issue. I may be able to get around it in my application by guaranteeing that the enable is continuous during processing. Thanks for the insight.
10-17-2016 05:32 PM
A couple of pointer to try to improve the solution.
1- in 2016.3 there is more registers, by default in the AXIS interface ports; basically registering the forward and reverse (back pressure) paths. this might help you in this design as you mentioned AXIS interface and tight coupling.
2- if you consider the fanin/fanout of the enables signals to the BRAMs, how many BRAMs (18K or 36K) are you talking about? I believe that this number alongside the logic generated by VHLS limits how far / how fast / how many BRAMs you can reach - all those being a tradeoff of each other - for example can reach 100 at 3 ns, can reach 400 at 5 ns , can reach 10 at 2 ns.. ! made up numbers but I hope you get the "picture".
Long story short, if you can *really* split the logic driving the M BRAMS in 2, 4,.. N then I would think that you can improve the situation. Rather than N=1 cloud of logic faning out to M BRAMs, N=2 clouds of logic faning out to M/2 BRAMs should be easier. At the extreme, the easiest would be N=M clouds and M BRAMs , in a 1:1 connection would probably in the faster design, but M logic clouds might not be negligible for your application
Finding the right trade-off for N is a task left to you until improvements are made in the tools (it's not a VHLS-only issue as this touch how logic synthesis or place do or might logic duplication, for example).
I hope this helps, but please note those are my "personal" views only!
PS: I don't think you can guarantee that BRAMs enables are always high since as soon as the VHLS generated FSM is stalled then the enables would drop - this is by "construct" - and this is also the situation when you use AXIS/FIFO interfaces - there is always logic generated for back pressure / stalling.
10-17-2016 06:33 PM
Thanks for the response! I saw the axis update for 2016.3 and that looks very helpful for those interfaces. However for the ap_memory interface when connecting to BRAMs the clock/chip enables have to go directly to primitives. Even for not terribly complex designs those interfaces can have many levels of logic before going directly to the primitives. In my design (which is very complex, lots of control) I had 8-12 levels of logic on just those enables before going to the BRAM. Those nets always show up as large failures in a post-synthesis timing analysis and to my knowledge there is no reasonable work around. This is something I don't allow in any HDL I code or my co-workers code as this leads to large difficulties closing timing. Each instance of my design had a 16 block RAM memory but there were many instances of it. I was able to close timing on the final design by slowing the clock down from 250 to 200 (but post-Synthesis still shows many violations). Fortunately at the end of the day I was able to meet my performance requirements.
In general I think this issue is fundamental to the way HLS generates its state machines and control and I don't think it will be resolved without a major re architecture of the tool.