cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
mkhazraee
Observer
Observer
1,111 Views
Registered: ‎05-04-2018

Verify Partial Reconfiguration flow is actually locking routing for static part of the design

Jump to solution

Hi all,

I'm using Partial Reconfiguration for a large design on ultrascale+ (xcvu9p). I have two sets of PR modules, an initial accelerator (let's call it A), and then 16 parallel accelerators (let's call them B) which are connected with 2 level heavily pipelined switches, and a gateway unit before those parallel units. In other words, there is no signal going from A to B to make timing of A and B units dependent. I also define a range for boundary of each PR PBlock to help the routing process. 

The main implementation, with A1 and B1 meets timing. I then make a PR run with A2 and B2, which is child of the main implementation and it also meets timing with the same margin (actually in both cases the WNS is in a module entirely in static portion and in the softcore part of PCIe IP that is placed inside an EXCLUSIVE PBlock). Then I make a third run just to be able to load the full FPGA with A1 and B2, and it fails timing! (It's single node failing with WNS of -0.014 but still). This is very counter intuitive: since already I have A1 and B2 meet timing I should be able to simply load them on the fly program instead of using this run. I expected vivado to be smart enough to actually simply reuse the placed and routed PR modules, but it didn't.

What I'm worried about is the routing is not really locked. I'm using project flow, still in the log I do see the project flow doing write_checkpoint, update_design to use blackbox for all the PR blocks, and then locking netlist, placement, and routing in order. Also in some of the failing runs previously I saw a passed base design, with a failing net in the PR run (A2 and B2) in the static part. I thought that should be a fly-over net, but actually it was a bit far (another SLR and for totally disconnected module to the accelerators) to be the case. 

How can I make sure the routing is the same among them? As in be sure I can do A1+B2 on the fly without causing problems. It's a large design and cross checking visually is not possible, so is there a command for it?

Also, the tcl script that I run for making the last run is very simple: create_pr_configuration, create_run as child of the main implementation, launch run. These are the same steps making the run visually in Vivado. Should I do any other step in the project flow to load the locked design, or something of that sort? 

Thanks,
Moein

0 Kudos
1 Solution

Accepted Solutions
davidd
Xilinx Employee
Xilinx Employee
596 Views
Registered: ‎11-17-2008

@mkhazraee,

1. EXCLUDE_PLACEMENT is always set on Reconfigurable Partitions.  This is a fundamental requirement of DFX, keeping logical placement separated so that GSR initialization only hits all of the dynamic logic after it has been loaded.  We won't let you turn off EXCLUDE_PLACEMENT.  This is not EXCLUDE_ROUTING however -- there is no such thing.  Static routing is free to go wherever it needs to, and that's just fine.  Static routes that cross dynamic regions will not glitch during reconfiguration and are not held to a constant value.  Load up the static-only locked design checkpoint and you'll see all the routes are dashed, indicating they are locked.
As for boundary nets, some may be completely rerouted if the static logical element is within the expanded routing region.  If the placement of this element is inside the larger footprint that will be used to build the bitstream and the entire route is within this expanded region, we can take that point as the partition pin and be more flexible while routing the entire net -- we're going to reconfigure that entire region anyway, so we don't want to restrict the solution space.  But when you say "end cell inside the PR" do you mean a logical element within the Reconfigurable Module hierarchy?  That can definitely change -- it's dynamic!  Use the hd_visual scripts to see your base Placement area compared to the expanded Routing area -- if you're not aligned to clock regions (which is fine), they'll be different (which is also fine).  For more information on this topic, check UG909 or watch my Quicktake Video.

2. Check out the combination of A2 + B1 using the methodology I described in my prior post.  If you're building a full configuration using routed checkpoints for top and A and B, they had better create a fully routed result.  This is where it is critical to create all your A and B modules using the same version of locked static top.  But if you can't make it work for the A2 + B2 child run, look at what's happening on the static side of that BRAM column.  Perhaps some static usage is preventing the second run from building a particular shape, and some guidance (like prohibiting static sites or adjusting the floorplan) may be necessary.

3. The SCOPED_TO_CELLS directive is simply the way you assign a particular checkpoint to a particular opening.  You could have multiple instantiations with the same aperture so this command (or read_checkpoint -cell) makes the assignment clear, removing any ambiguity. 

Regarding pr_verify, it's only looking to match what it sees, confirming that the entirety of the static is identical and safe to use for bitstream generation.  It expects to see differences in the RMs, completely routed or not.  To validate routing in general, run report_route_status on the post-route design to ensure everything has completed successfully.  The log file should also tell you, but this is an explicit check that you can perform as well.

thanks,

david.

View solution in original post

9 Replies
hongh
Moderator
Moderator
1,073 Views
Registered: ‎11-04-2010

1. If configuration{ A1+B1} and configuration {A2+B2 } can complete routing without timing violation based on the same static logic, configuration{A1+B2} can be got directly by reusing the result of previous result. It looks currently project flow are not smart enough to reuse the existing result. 

You can try non-project flow to get the routed design for  configuration{A1+B2}.

2.To confirm whether routing is same among the different configurations, you can try to run pr_verify command.

3. The commands you use in the project-flow looks correct. 

-------------------------------------------------------------------------
Don't forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
mkhazraee
Observer
Observer
979 Views
Registered: ‎05-04-2018

1) Well, pr_verify is not that useful. I cannot do pr_verify between two PR blocks from different designs, as it tells me they don't have any PR blocks inside (0 is not same as 0, which both should be non-zero). So I cannot check if I can use B2 in the design with A1+B1. I can do among full designs A1+B1 and A2+B2 (or their black boxed versions), but the tricky part is I have another set of runs where A1+B2 design has an unrouted net, and still if I do pr_verify among A1+B1 and A1+B2 it passes, so not sure what is pr_verify assuring me of!

2) Based on your 1 and 3 response, are you suggesting that project flow might have some bugs and doing non-project flow might do a better job?

 

Also something is missing about the routing of PR blocks. This is the failed net in the 3rd run, and its zoomed version:

failed_timing_net.jpg        failed_timing_net_zoomed.jpg

 

 

 

 

 

 

As you can see it starts from a static cell (orange) to a boundary cell (light grey). Now looking at the same path in the parent implementation that met timing we have:

failed_timing_net_orig_impl.jpg

 

 

 

 

 

 

 

 

 

(I couldn't find the path in the original design as it wasn't top 10 worst WNS, so I had to use these commands and you see extra nets, but not extra cells:
highlight_objects -color red [get_cells -of_objects [get_timing_paths -from ... -to ...]]
highlight_objects -color yellow [get_nets -of_objects [get_timing_paths -from ... -to ... ]])

The static side looks similar, but in the PR part one of them ends in a boundary cell, which makes sense, but the other one doesn't. I have two questions:

3) What's the granularity of partial reconfig for routing resources? As mentioned in other posts the routing resources above PR blocks are shared with static part (we can use EXCLUSIVE but it won't meet timing, as mentioned by Xilinx representative in another forum post). So each single routing can be updated during PR update? 

4) Assuming last point is the case, isn't the idea of boundary nets to stabilize the interface and routing to it, so we have to only deal with routing between boundary and internal cells of a PBlock? What nets are reconsidered for routing that causes such difference shown in the photos?

Thanks,
Moein
[P.S.: I first posted a longer and more detailed reply, but somehow it got removed, which is very annoying and disappointing!]

0 Kudos
mkhazraee
Observer
Observer
812 Views
Registered: ‎05-04-2018

Let me make the problem more clear with a daunting error. I make a base design with A1 and B1, then I do a next build with A1 and B2, and finally a next build with A2 and B2. Several times in the last build and sometimes in the second build I get this error:

Phase 4 Initial Routing Verification

Post Initial-Routing Verification
---------------------------------
CRITICAL WARNING: [Route 35-54] Net: core_inst/riscv_cores[11].core_wrapper/axis_dma_inst/dma_cmd_hdr_wr_addr[2] is not completely routed.
Resolution: Run report_route_status for more information.
INFO: [Route 35-3320] Number of nets that did not attempt to route: 12359

Unroutable connection Types: 
----------------------------
Checking all reachable nodes within 5 hops of driver and load 

Unroute Type 1 : Potential Placement Issue

	Type 1SLICEL.AMUX->SLICEM.D6]
	-----Num Open nets: 1
	-----Representative Net: Net[384048] core_inst/riscv_cores[11].core_wrapper/axis_dma_inst/dma_cmd_hdr_wr_addr[2]
	-----SLICE_X147Y626/AMUX -> SLICE_X146Y635/D6
	-----Driver Term: core_inst/riscv_cores[11].core_wrapper/axis_dma_inst/riscv_cores[11].pr_wrapper_i_20/O Load Term [1904898]: core_inst/riscv_cores[11].pr_wrapper/dma_wr_reg/simple_reg/pipe_reg[0].reg_inst/m_axis_tdata_reg[48]_i_1__1/I3
	Type 2SLICEL.AMUX->SLICEM.EX]
	-----Num Open nets: 1
	-----Representative Net: Net[384048] core_inst/riscv_cores[11].core_wrapper/axis_dma_inst/dma_cmd_hdr_wr_addr[2]
	-----SLICE_X147Y626/AMUX -> SLICE_X146Y635/EX
	-----Driver Term: core_inst/riscv_cores[11].core_wrapper/axis_dma_inst/riscv_cores[11].pr_wrapper_i_20/O Load Term [1904897]: core_inst/riscv_cores[11].pr_wrapper/dma_wr_reg/simple_reg/pipe_reg[0].reg_inst/temp_m_axis_tdata_reg_reg[48]/D
Phase 4 Initial Routing Verification | Checksum: 20a740973

This is from the third step, meaning I am just changing the A1 to A2, while the error is showing in one of the B2 modules (riscv wrapper). This is a good run that I got one, sometimes I get 50 of these errors. I tried incremental build from A1+B2 (from the routed dcp), and also tried a run where I make A1 to be greybox so no routing inside it, and then make A2+B2 from Grey+B2 with incremental, both to no avail.

Some modification to A1 and the error doesn't show up, do another change to either As or Bs and again it shows up. And as I mentioned As and Bs are separated by 2 levels of switching and at least 5 pipeline registers. And the net in error is between the riscv module and its wrapper, where the riscv is only connected to wrapper and nothing else in the design: wrapper is on the static part and managing dma to the riscv memory, and riscv is inside a PR pblock.

Reading on forums about this error indicates a potential problem in XDC rules making the routing impossible, but I only use range HD.PARTPIN_RANGE and let Vivado select the exact placement, and again the PR run is not changing anything on the failing net part. So what is vivado trying to achieve that breaks a net which shouldn't be changed at all? As and Bs are not even on the same SLR. 

0 Kudos
davidd
Xilinx Employee
Xilinx Employee
721 Views
Registered: ‎11-17-2008

@mkhazraee,

This is a fundamental limitation of running the DFX flow in project mode -- while you may have generated results for RMs A1, A2, B1 and B2, the flow will not reuse results from child runs in further child runs.  So as you have seen, if the first child run is A2+B2 and the second child run is A1+B2, the results for B2 could be different as they are created by implementing from the post-synthesis checkpoint each time.  You can see this not only in your results but in the Tcl that is called for each run.  The reason we do this is dependency management.  There is a relationship between parent and child, but not between different child runs.  For example, if the parent run is marked out of date and needs to rerun, all child runs are also reset to be rerun.  This is not true if one child run (say the one with the first B2 instance) is reset -- the other child run is not reset, and then later the two B2 results may not match.

As Hong noted earlier, you can build any configuration with different RMs by linking routed checkpoints.  This must be done outside the project framework, but you can load the routed and locked static checkpoint, then link in routed checkpoints for A1 and B2, then you can generate full and partial bitstreams, run timing analysis or simulation, or whatever else you'd like to do.  So you can still get the combination you need by starting in the project flow, and in the non-project flow you have full control of different combinations that are possible.

thanks,

david.

0 Kudos
mkhazraee
Observer
Observer
698 Views
Registered: ‎05-04-2018

Hi @davidd and thanks for your reply. 

Let me clarify things a little bit more. The parent run is A1+B1, not a child run. So from parent run of A1+B1, I first swap the B1 to be B2, and then swap A1 to A2. The snapshots in the previous post are comparing the parent run with the child run, not the child runs among themselves.

Also, I tried what you said but in project mode, by adding the first child's routed DCP file to the utils filelist of the second child and mentioning to use that for incremental implementation. (First child is A1+B2, second child is A2+B2). Indeed it says all 16 instances of B2 are the same in both runs, and it has 7% change from A1 to A2. However, it never finished routing in this mode, always having 2 or so unrouted nets. I further did another try where after A1+B2, I made a different run from that where A1 was greybox (greybox+B2), so the fixed placement doesn't impact the unrouted nets, and the timing closure slack was exactly the same as running A2+B2 from the parent run (A1+B1). 

I'm not sure if there is anything that is more limited in project mode than non-project mode, as I see it literally just puts the tcl commands for locking the placement and routing at the end of parent run, exactly same as the non-project mode tutorials. And as I said, I was able to do the DCP load to start from a child process. Please let me know if there is another limitation in project mode that I can check, or generally speaking I shouldn't use project mode for PR runs.

Moein

P.S.: Increasing boundary cells for each of B2 Pblocks helped reduce the occurrence of that unroute error significantly, but it still happens sometimes.

0 Kudos
davidd
Xilinx Employee
Xilinx Employee
618 Views
Registered: ‎11-17-2008

@mkhazraee, I think we are not aligned in our understanding yet.

Are you looking to modify the results of top or any of these A or B runs, or are you just looking to create a specific combination of RMs to build a full design image?  My explanation was for the latter.  You only have to build any Reconfigurable Module once; as soon as you have a result you can mix and match to get any combination.  If you only have RMs A1 and A2 for RP A, and RMs B1 and B2 for RP B, you only need to run through place and route.  With a parent run of Top+A1+B1 and a child run of Top(locked)+A2+B2, you don't have to run place and route to get results for Top(locked)+A1+B2 or Top(locked)+A2+B1, you just load in the appropriate checkpoints and link them together.  This step though must not be done in project mode, it must be done manually.  Something like this:

create_project -in_memory -part $part 

add_files ./Checkpoint/static_route_design.dcp

add_file ./A1_rm_routed.dcp

set_property SCOPED_TO_CELLS {inst_A} [get_files ./A1_rm_routed.dcp ]

add_file ./B2_rm_routed.dcp

set_property SCOPED_TO_CELLS {inst_B} [get_files ./B2_rm_routed.dcp

link_design -mode default -reconfig_partitions {inst_A inst_B} -part $part -top $top

At this point, because Top, A1 and B2 were all routed coming in, the entire design is already routed and you can build a Top+A1+B2 bitstream or run timing analysis or whatever.  No, project mode doesn't account for this path, but the project mode is focused on getting all the RMs routed to get all your partial bitstreams, and keeping dependencies managed.

Does this get you what you need?

thanks,

david.

mkhazraee
Observer
Observer
604 Views
Registered: ‎05-04-2018

@davidd I'm trying to make a full design (and obviously swap them while the FPGA is running). What you are saying makes perfect sense and that's what I'm looking for, assuming everything is working as expected. But mainly I'm concerned if they are working as expected!

1) Looking at the routing results from base run and PR run, I was worried that there is a bug in the routing. I think the bug comes from exclude_placement not being set. From what I read on the forums, If I don't do EXCLUDE_PLACEMENT, Vivado can use routing resources inside the pblock for the static part, which is the default case and easier for Vivado to meet timing. (It's a little counter intuitive, I thought a PR should be exclude placement to be swapped, but the routing is flexible. But I guess here since PR already is not SOFT and already CONTAINS ROUTING, EXCLUDE_PLACEMENT acts as EXCLUDE_ROUTING). 

Now if EXLUDE_PLACEMENT is not set, how does that work when swapping a PR region while FPGA is running? In other words, if I want to use SCOPED_TO_CELLS, is it safe when EXCLUDE_PLACEMENT is not set and some routing resources inside the PR block are used for the static design? Based on routes being locked, I was expecting the routes to the boundary cells to be fixed in the base run, and the routes that are going through PR block be fixed or at least end points be fixed. But those snapshots show that's not the case, the end cell inside the PR changes between the base run and one of the PR runs. Maybe this is just a bug happening only for runs without EXCLUDE_PLACEMENT.

2) There is also another scenario that really worries me: locked BRAM cells. Here I've already asked a question regarding that:
https://forums.xilinx.com/t5/Implementation/Blocked-BRAM-after-PR-update/m-p/1080796#M27542

Long story short, if I make a base run with A1+B1, and I want to make a PR run with the exact same A1+B1 configuration, I lose some of my BRAMs! Now Consider I do A1+B1 in one run, and I do A2+B2 the PR run as you suggested (B2 uses less memory due to the problem). I'm worried that using SCOPED_TO_CELLS, A2+B1 is not going to work, because I was never able to make it work in a PR child run.

3) I'm going to try SCOPED_TO_CELLS method, but honestly it would have been much nicer if Vivado could do the merging by it's own, for example once I did a mistake and started a PR run which was exactly the same as the parent run, while the checkpoint load was forced to be from the routed parent DCP. It took 3 hours for Vivado to mention they're the same and output the bitstream after a quick round of routing. 

Also, the fact that pr_verify didn't warn about an unrouted net is very concerning. Fortunately, Vivado didn't generate the bitstream when there was an unrouted net, so if after SCOPED_TO_CELLS Vivado generated a bitstream I assume that's going to work.

Moein

0 Kudos
davidd
Xilinx Employee
Xilinx Employee
597 Views
Registered: ‎11-17-2008

@mkhazraee,

1. EXCLUDE_PLACEMENT is always set on Reconfigurable Partitions.  This is a fundamental requirement of DFX, keeping logical placement separated so that GSR initialization only hits all of the dynamic logic after it has been loaded.  We won't let you turn off EXCLUDE_PLACEMENT.  This is not EXCLUDE_ROUTING however -- there is no such thing.  Static routing is free to go wherever it needs to, and that's just fine.  Static routes that cross dynamic regions will not glitch during reconfiguration and are not held to a constant value.  Load up the static-only locked design checkpoint and you'll see all the routes are dashed, indicating they are locked.
As for boundary nets, some may be completely rerouted if the static logical element is within the expanded routing region.  If the placement of this element is inside the larger footprint that will be used to build the bitstream and the entire route is within this expanded region, we can take that point as the partition pin and be more flexible while routing the entire net -- we're going to reconfigure that entire region anyway, so we don't want to restrict the solution space.  But when you say "end cell inside the PR" do you mean a logical element within the Reconfigurable Module hierarchy?  That can definitely change -- it's dynamic!  Use the hd_visual scripts to see your base Placement area compared to the expanded Routing area -- if you're not aligned to clock regions (which is fine), they'll be different (which is also fine).  For more information on this topic, check UG909 or watch my Quicktake Video.

2. Check out the combination of A2 + B1 using the methodology I described in my prior post.  If you're building a full configuration using routed checkpoints for top and A and B, they had better create a fully routed result.  This is where it is critical to create all your A and B modules using the same version of locked static top.  But if you can't make it work for the A2 + B2 child run, look at what's happening on the static side of that BRAM column.  Perhaps some static usage is preventing the second run from building a particular shape, and some guidance (like prohibiting static sites or adjusting the floorplan) may be necessary.

3. The SCOPED_TO_CELLS directive is simply the way you assign a particular checkpoint to a particular opening.  You could have multiple instantiations with the same aperture so this command (or read_checkpoint -cell) makes the assignment clear, removing any ambiguity. 

Regarding pr_verify, it's only looking to match what it sees, confirming that the entirety of the static is identical and safe to use for bitstream generation.  It expects to see differences in the RMs, completely routed or not.  To validate routing in general, run report_route_status on the post-route design to ensure everything has completed successfully.  The log file should also tell you, but this is an explicit check that you can perform as well.

thanks,

david.

View solution in original post

mkhazraee
Observer
Observer
583 Views
Registered: ‎05-04-2018

Thanks a lot @davidd. That explains everything.

For some reason I've seen the EXCLUDE_PLACEMENT not set for a PR a while back, maybe that was a buggy run, and hence the confusion. Also I thought the contain routing means nothing is going outside the PR region. But now the expanded routing region for the boundary cell totally explains those snapshots. Also thanks for clearing what pr_verify is accomplishing, and hence no warning.

I'll try your methodology for the locked BRAM and follow up on the other post, I can see how SCOPED_TO_CELLS can make things much more clear. 

Moein

0 Kudos