cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Participant
Participant
9,468 Views
Registered: ‎08-28-2009

Vivado 2015.4: new RPM bugs

RPM support in ISE was rock solid. You could depend upon a tiled RPM methodology to quickly implement precisely what you wanted and then use the repeatability of floorplanned RPM layouts to compose large correct designs and quickly close timing in a methodical and predictable way. (For example see the XC7VX690T design here: http://fpga.org/2013/05/24/fpgas-then-and-now/).

 

RPM support in Vivado has been less satisfactory. LUT_MAP was removed. LUT_MAP made development and maintenance of portable (especially, parameterized) technology mapped LUTs a headache. I can't believe my RTL now contains INIT=hex constants. We didn't even need those in the 1990s FMAP-era. And RLOC_ORIGIN was removed, so RPM placement constraint anchor now comes only from a set_property LOC in the XDC file.

 

Most challenging, however, was that the Vivado RPM "shape builder" would refuse certain legal packs (particularly of LUT RAM) and in 2015.1 would often fail when composing a hierarchical RPM from an RPM column of LUT RAM with other RPM columns of logic. I infer (from errors it generates) that under the hood it tries to build the shape at virtual device site SLICE_X100Y0 -- and of course there may not be a SLICEM thereabouts -- and it doesn't matter that you have a site placement in the XDC file to anchor the LUT RAM column(s) to a valid site.

 

Another problem -- sometimes the tools that optimize a primitive forget to propagate its RLOC to the new primitive. dont_touch helps with that.

 

I have just upgraded from V2015.1 to V2015.4, and some of my older Vivado RPMs now fail in Shape Builder.

 

*** I think a key design tenet of Shape Builder must be: if it is a legal pack, if the designer can hand place it, if it is a valid XDC macro, then it *must* be treated as a valid RPM. Particularly in Vivado 2015.4, Shape Builder does not achieve this. ***

 

Today I spent several hours trying little RPMs to see what work and what doesn't. Here are some findings.

 

1. In general, hierarchical RPMs, particularly ones that involve LUT6_s or fused LUT5s, often fail Shape Building.

 

2. Shape Builder can fail to pack 16 LUT5s (8 pairs of LUT5s with common 5 inputs -- a legal pack) into a single UltraScale slice -- but it will happily zip together one RPM of 8 LUT5s with another RPM of 8 LUT5s, into the same slice!

 

3. Sometimes the RLOCs get quietly dropped on the floor.

 

I have attached a file with some test cases that show some of these bugs. I will ask my FAE to help file them. Here is an example. (See the ZIP attachment for full details of how to reproduce these passing and failing test cases.)

 

// Four 1 slice RPMs.

// This works -- with XDC you can even constrain the four subinstances ao0,a08,ao16,ao24

// to a 2x2 slice floorplan as demonstrated in the attachment.

//

(* keep_hierarchy="yes" *)
module OK1(
    input [31:0] a,
    input [31:0] b,
    output [31:0] and_,
    output [31:0] or_);

    ANDOR8 ao0(.a(a[7:0]), .b(b[7:0]), .and_(and_[7:0]), .or_(or_[7:0]));
    ANDOR8 ao8(.a(a[15:8]), .b(b[15:8]), .and_(and_[15:8]), .or_(or_[15:8]));
    ANDOR8 ao16(.a(a[23:16]), .b(b[23:16]), .and_(and_[23:16]), .or_(or_[23:16]));
    ANDOR8 ao24(.a(a[31:24]), .b(b[31:24]), .and_(and_[31:24]), .or_(or_[31:24]));
endmodule

// Now compose this into a 2x2 slice hierarchical RPM.

// This is a legal placement/pack -- but fails with many errors from Shape Builder

//
(* keep_hierarchy="yes" *)
module BUG1(
    input [31:0] a,
    input [31:0] b,
    output [31:0] and_,
    output [31:0] or_);

    (* RLOC="X0Y0" *)
    ANDOR8 ao0(.a(a[7:0]), .b(b[7:0]), .and_(and_[7:0]), .or_(or_[7:0]));
    (* RLOC="X0Y1" *)
    ANDOR8 ao8(.a(a[15:8]), .b(b[15:8]), .and_(and_[15:8]), .or_(or_[15:8]));
    (* RLOC="X1Y0" *)
    ANDOR8 ao16(.a(a[23:16]), .b(b[23:16]), .and_(and_[23:16]), .or_(or_[23:16]));
    (* RLOC="X1Y1" *)
    ANDOR8 ao24(.a(a[31:24]), .b(b[31:24]), .and_(and_[31:24]), .or_(or_[31:24]));
endmodule

// A silly, artificial example of an RPM slice comprising 8 LUT6_2s.

// AND/OR function notwithstanding, this is a key use case for optimal use of FPGA resources.
(* keep_hierarchy="yes" *)
module ANDOR8(
    input [7:0] a,
    input [7:0] b,
    output [7:0] and_,
    output [7:0] or_);

    genvar i;
    generate
        for (i = 0; i < 8; i = i + 1) begin : aos
            (* RLOC="X0Y0" *)
            LUT6_2 #(.INIT(64'h88888888EEEEEEEE))
                ao(.I0(a[i]), .I1(b[i]), .I2(1'b0), .I3(1'b0), .I4(1'b0), .I5(1'b1),
                   .O6(and_[i]), .O5(or_[i]));
        end
    endgenerate
endmodule

 

Implementing BUG1 yields these errors:

=>

"CRITICAL WARNING: [Shape Builder 18-137] Cannot obey LUTNM/HLUTNM constraint for instances ao0/aos[0].ao/LUT6 and ao0/aos[0].ao/LUT5. Shape contains more LUTs than SLICE maximum capacity.

...

CRITICAL WARNING: [Shape Builder 18-140] Failed to build a LUTNM shape for instances ao8/aos[0].ao/LUT6 and ao8/aos[0].ao/LUT5. Failed to add instance ao8/aos[0].ao/LUT5 to a new LUTNM shape. The instance already belongs to a shape and its new location conflicts with the one in the existing shape.

...

CRITICAL WARNING: [Shape Builder 18-146] Failed to build an RLOC shape for set IHST. No placement was found that satisfies the grid spacing requirements for all instances in the set. This may be due to an invalid RLOC constraint causing incorrect column spacing. The macro reference instance is ao0/aos[0].ao/LUT5 and the macro is 2 columns by 2 rows.

"

 

I don't know why Shape Builder refuses this. If there is something I am missing, if there is a simple workaround, I would be most grateful to try it.

 

(I know about XDC macros. I prefer RPMs. When you work with RPMs in RTL, they (should) just work wherever they are instantiated, in any hierarchy or station. Not so for XDC macros.)

 

Thank you for any help or guidance.

6 Replies
Observer
Observer
3,407 Views
Registered: ‎05-11-2010

Jan, I am seeing a related issue. I have a large built-up array of small FIR filters. It built right away without issue in ISE14.7, but Vivado has been a long exercise in whack a mole to get a happy compile. Well, after changing my hierarchical components to instantiate Carry4's instead of the old mux and xor carry components, and nailing down LUTs and flip-flops with BELS I am getting compiles that end with a routed design (amid some complaints about collisions during placement). It even routes fine as an OOC component. However, when I try to instantiate that component with routes preserved (as well as with just placement preserved), it generates a bazillion "unroutable" errors within my previously placed and routed design_checkpoint upon reading in the checkpoint, and it errors during placer phase 1 placer initialization with: Phase 1.1 IO Placement/ Clock Placement/ Build Placer Device ERROR: [Shape Builder 18-147] Failed to build an RLOC shape for set pp_computer_core_i/dsp_i/DSP/GRID/GRDROW[2].GRDCOL[2].ACTIVE.COEFSTR. Reason: For the placement location passed in the bel and site locations are in disagreement.. Phase 1.1 IO Placement/ Clock Placement/ Build Placer Device | Checksum: 1cf55006e Time (s): cpu = 00:02:50 ; elapsed = 00:02:50 . Memory (MB): peak = 7092.957 ; gain = 0.000 Phase 1 Placer Initialization | Checksum: 1cf55006e Time (s): cpu = 00:02:51 ; elapsed = 00:02:51 . Memory (MB): peak = 7092.957 ; gain = 0.000 ERROR: [Place 30-99] Placer failed with error: 'Failed to build an RLOC shape. Please see the previously displayed errors.' Please review all ERROR, CRITICAL WARNING, and WARNING messages during placement to understand the cause for failure. Ending Placer Task | Checksum: 11043e71a The previous errors while reading in the implemented dcp show unroutes mostly with collisions at the BFFMUX at slice locations not where that logic was, and it still shows in the correct locations in a gui design view, complete with correct routing. Totally scratching my head on this at this point, was hoping you might have learned more since your post
Observer
Observer
3,401 Views
Registered: ‎05-11-2010

By the way, RLOC_ORIGIN in the VHDL code does work, although with the errors noted above, it complains "could not process RLOC_ORIGIN xxxxxx for cell yyyyyyy if there are any issues. Issues that I know break RLOC ORIGIN are 1) no component at the RPM's x0 y0. If there was a component there that is optimized out, still breaks RLOC_ORIGIN 2) no competition for resources. If two LUTs end up both getting mapped to LUT5's or LUT6's for the same site, or if it puts LUT5 on the carry chain mux select input then it fails. If the LUT you specify is changed at all (e.g. inverter absorbed into a pin or a pin tied to VCC or GND), it reformulates the LUT and might not assign the correct BEL, in which case the RPM goes poof.
Participant
Participant
3,392 Views
Registered: ‎08-28-2009

Hi Ray, thanks for your follow up. It got me to revisit the four bugs I attached in the original post. I find that with Vivado 2017.1:

 

1) BUG1 is fixed -- it honors a 2x2 slice RPM arrangement of LUT6_2s.

 

2) BUG2/3 are not fixed -- if you pour 16 LUT5s that can be mapped into 8 LUT6_2s in an (UltraScale) slice, it fails to find the legal pack. Workaround: if you manually attach BEL="A5LUT", BEL="A6LUT", BEL="B5LUT", etc., this packs and builds the RPM.

 

3) BUG4 is not fixed. It places a two level hierarchy e.g. at (0,0) (2,0), (0,2), (2,2) and the placer still loses the hierarchy. One of the first level RPMs is place correctly, the others float at no particular relative placement. Perhaps (again) it is something to do with LUT5's.

 

I see that RLOC_ORIGIN now works although I don't understand how -- when I open the synthesized DCP and try to view RLOC_ORIGIN property on the particular module, it does not show up in the module property view. Perhaps it is there but there is a bug in the module property viewer that it does not display RLOC_ORIGIN properties.

 

With (1) fixed and esp. with RLOC_ORIGIN working, perhaps I can start using RPMs for my CPU datapaths in UltraScale again. (I'll have to check if Xilinx has fixed the "multicolumn RPM with LUT-RAM" bugs.)

 

Also, I still miss the LUT_MAP synthesis constraint.

0 Kudos
Reply
Participant
Participant
3,389 Views
Registered: ‎08-28-2009

Also, Ray, it's been a while, but in the Vivado 2015.4 era, I saw cases where you specify a legal RPM, the placer gives RPM placement errors, then the design builds fine anyway, generating the pack you wanted in the first place.

 

And I saw cases where the design placed and routed OK, but when you opened the implemented design, the GUI gave RPM placement errors.

 

So I put RPMs aside for a while and have been using rectangular pblocks. See e.g. http://fpga.org/grvi-phalanx/ But in targeting AWS F1, which puts the client (custom) logic in a PR region spread across SLRs, even rectangular pblocks are not honored in surprising ways. (Check out SNAPPING_MODE and PR rules.) So for now I'm just constraining my F1 logic to clock region aligned pblocks. Sigh.

 

But now with RLOC_ORIGIN and hierarchical pblocks of LUT6_2s (and presumably LUT6s) apparently working, I may be able to start using RPMs again, in some carefully constrained ways. That will make for better QOR and more importantly more consistent implementation runs.

 

Sorry, I don't know what to make of your BFFMUX problem!

0 Kudos
Reply
Observer
Observer
3,327 Views
Registered: ‎05-11-2010

I'm using vivado 2016.4, driven by customer requirement. I've had no end of troubles trying to use the LUT6_2, I wound up taking those out and substituting BEL'd LUT5 and LUT6 (or smaller LUTs) with success most of the time. The design I am working on right now has 16 banks, each is a summation of 64 4 tap complex FIR filters. Each bank is fed by the same 64 channel inputs. In effect it makes a 64x16 channel crossbar with 4 tap FIR filters at each node. When done without OOC, the RPMS are getting through and it is placing and routing, meeting 500MHz in a 7vx690T. I haven't been able to coax it to 601 MHz (BRAM switching limit) with Vivado, though ISE 14.7 has been achieving that from day 1. I've been seeing the same thing where placer throws RPM errors, but the design builds fine. I've had a couple times where opening the GUI complains about placement errors, but then when I go and look at the placement and routing in the device window, it is fine. However, the big issue I am having now is with trying to read the .dcp file for a design that routed fine back into a top level design as an out of context component. I ended up placing pretty much everything with RLOCs and BELs because I wasn't getting anywhere close with Pblocks, and the large number of Pblocks needed for the 1024 filter nodes was blowing up PAR. It looks like the issue with the BFFMUX's might have to do LUTs used as wires to FF inputs not paying attention to the existing LUT BELs in the same slice (there are open LUTs that will allow the wire, but the placer/router is assigning a BEL that conflicts with one I locked....I think.
0 Kudos
Reply
Observer
Observer
3,324 Views
Registered: ‎05-11-2010

I also miss the LUT_MAP.    My new work-around is to make a function that generates the INIT string.  For example

 

function mux2_xor return bit_vector is
variable i: unsigned(3 downto 0);
variable vector: bit_vector(15 downto 0);
begin
--i0 xor i1; when invertB= false else i0 xnor i1
for addr in 0 to 15 loop
i:= to_unsigned(addr,4);
if invert_A0 then
i(0):= not i(0); --d0 input
end if;
if invert_A1 then
i(1):= not i(1); --d1 input
end if;
if i(2)='0' then --sel input
vector(addr):= to_bit(i(0) xor i(3));
else
vector(addr):= to_bit(i(1) xor i(3));
end if;
end loop;
return vector;
end function mux2_xor;

function lut_wire return bit_vector is
variable i: unsigned(0 downto 0);
variable vector: bit_vector(1 downto 0);
begin
for addr in 0 to 1 loop
i:= to_unsigned(addr,1);
vector(addr):= to_bit(i(0));
end loop;
return vector;
end function lut_wire;

 

 

begin 

U1_6: LUT4
generic map(
init => mux2_xor)
port map(
O => L6,
I0 => ax(i), --d0
I1 => cx(i), --d1
I2 => lcl_sel,
I3 => bx(i) );

-- L5 <= bx(i);
U1_5: LUT1
generic map(
init => lut_wire)
port map(
O => L5,
I0 => bx(i) );

 

Tags (1)