12-02-2015 03:46 PM
RPM support in ISE was rock solid. You could depend upon a tiled RPM methodology to quickly implement precisely what you wanted and then use the repeatability of floorplanned RPM layouts to compose large correct designs and quickly close timing in a methodical and predictable way. (For example see the XC7VX690T design here: http://fpga.org/2013/05/24/fpgas-then-and-now/).
RPM support in Vivado has been less satisfactory. LUT_MAP was removed. LUT_MAP made development and maintenance of portable (especially, parameterized) technology mapped LUTs a headache. I can't believe my RTL now contains INIT=hex constants. We didn't even need those in the 1990s FMAP-era. And RLOC_ORIGIN was removed, so RPM placement constraint anchor now comes only from a set_property LOC in the XDC file.
Most challenging, however, was that the Vivado RPM "shape builder" would refuse certain legal packs (particularly of LUT RAM) and in 2015.1 would often fail when composing a hierarchical RPM from an RPM column of LUT RAM with other RPM columns of logic. I infer (from errors it generates) that under the hood it tries to build the shape at virtual device site SLICE_X100Y0 -- and of course there may not be a SLICEM thereabouts -- and it doesn't matter that you have a site placement in the XDC file to anchor the LUT RAM column(s) to a valid site.
Another problem -- sometimes the tools that optimize a primitive forget to propagate its RLOC to the new primitive. dont_touch helps with that.
I have just upgraded from V2015.1 to V2015.4, and some of my older Vivado RPMs now fail in Shape Builder.
*** I think a key design tenet of Shape Builder must be: if it is a legal pack, if the designer can hand place it, if it is a valid XDC macro, then it *must* be treated as a valid RPM. Particularly in Vivado 2015.4, Shape Builder does not achieve this. ***
Today I spent several hours trying little RPMs to see what work and what doesn't. Here are some findings.
1. In general, hierarchical RPMs, particularly ones that involve LUT6_s or fused LUT5s, often fail Shape Building.
2. Shape Builder can fail to pack 16 LUT5s (8 pairs of LUT5s with common 5 inputs -- a legal pack) into a single UltraScale slice -- but it will happily zip together one RPM of 8 LUT5s with another RPM of 8 LUT5s, into the same slice!
3. Sometimes the RLOCs get quietly dropped on the floor.
I have attached a file with some test cases that show some of these bugs. I will ask my FAE to help file them. Here is an example. (See the ZIP attachment for full details of how to reproduce these passing and failing test cases.)
// Four 1 slice RPMs.
// This works -- with XDC you can even constrain the four subinstances ao0,a08,ao16,ao24
// to a 2x2 slice floorplan as demonstrated in the attachment.
//
(* keep_hierarchy="yes" *)
module OK1(
input [31:0] a,
input [31:0] b,
output [31:0] and_,
output [31:0] or_);
ANDOR8 ao0(.a(a[7:0]), .b(b[7:0]), .and_(and_[7:0]), .or_(or_[7:0]));
ANDOR8 ao8(.a(a[15:8]), .b(b[15:8]), .and_(and_[15:8]), .or_(or_[15:8]));
ANDOR8 ao16(.a(a[23:16]), .b(b[23:16]), .and_(and_[23:16]), .or_(or_[23:16]));
ANDOR8 ao24(.a(a[31:24]), .b(b[31:24]), .and_(and_[31:24]), .or_(or_[31:24]));
endmodule
// Now compose this into a 2x2 slice hierarchical RPM.
// This is a legal placement/pack -- but fails with many errors from Shape Builder
//
(* keep_hierarchy="yes" *)
module BUG1(
input [31:0] a,
input [31:0] b,
output [31:0] and_,
output [31:0] or_);
(* RLOC="X0Y0" *)
ANDOR8 ao0(.a(a[7:0]), .b(b[7:0]), .and_(and_[7:0]), .or_(or_[7:0]));
(* RLOC="X0Y1" *)
ANDOR8 ao8(.a(a[15:8]), .b(b[15:8]), .and_(and_[15:8]), .or_(or_[15:8]));
(* RLOC="X1Y0" *)
ANDOR8 ao16(.a(a[23:16]), .b(b[23:16]), .and_(and_[23:16]), .or_(or_[23:16]));
(* RLOC="X1Y1" *)
ANDOR8 ao24(.a(a[31:24]), .b(b[31:24]), .and_(and_[31:24]), .or_(or_[31:24]));
endmodule
// A silly, artificial example of an RPM slice comprising 8 LUT6_2s.
// AND/OR function notwithstanding, this is a key use case for optimal use of FPGA resources.
(* keep_hierarchy="yes" *)
module ANDOR8(
input [7:0] a,
input [7:0] b,
output [7:0] and_,
output [7:0] or_);
genvar i;
generate
for (i = 0; i < 8; i = i + 1) begin : aos
(* RLOC="X0Y0" *)
LUT6_2 #(.INIT(64'h88888888EEEEEEEE))
ao(.I0(a[i]), .I1(b[i]), .I2(1'b0), .I3(1'b0), .I4(1'b0), .I5(1'b1),
.O6(and_[i]), .O5(or_[i]));
end
endgenerate
endmodule
Implementing BUG1 yields these errors:
=>
"CRITICAL WARNING: [Shape Builder 18-137] Cannot obey LUTNM/HLUTNM constraint for instances ao0/aos[0].ao/LUT6 and ao0/aos[0].ao/LUT5. Shape contains more LUTs than SLICE maximum capacity.
...
CRITICAL WARNING: [Shape Builder 18-140] Failed to build a LUTNM shape for instances ao8/aos[0].ao/LUT6 and ao8/aos[0].ao/LUT5. Failed to add instance ao8/aos[0].ao/LUT5 to a new LUTNM shape. The instance already belongs to a shape and its new location conflicts with the one in the existing shape.
...
CRITICAL WARNING: [Shape Builder 18-146] Failed to build an RLOC shape for set IHST. No placement was found that satisfies the grid spacing requirements for all instances in the set. This may be due to an invalid RLOC constraint causing incorrect column spacing. The macro reference instance is ao0/aos[0].ao/LUT5 and the macro is 2 columns by 2 rows.
"
I don't know why Shape Builder refuses this. If there is something I am missing, if there is a simple workaround, I would be most grateful to try it.
(I know about XDC macros. I prefer RPMs. When you work with RPMs in RTL, they (should) just work wherever they are instantiated, in any hierarchy or station. Not so for XDC macros.)
Thank you for any help or guidance.
11-08-2017 09:09 AM
11-08-2017 09:16 AM
11-08-2017 12:33 PM - edited 11-08-2017 12:48 PM
Hi Ray, thanks for your follow up. It got me to revisit the four bugs I attached in the original post. I find that with Vivado 2017.1:
1) BUG1 is fixed -- it honors a 2x2 slice RPM arrangement of LUT6_2s.
2) BUG2/3 are not fixed -- if you pour 16 LUT5s that can be mapped into 8 LUT6_2s in an (UltraScale) slice, it fails to find the legal pack. Workaround: if you manually attach BEL="A5LUT", BEL="A6LUT", BEL="B5LUT", etc., this packs and builds the RPM.
3) BUG4 is not fixed. It places a two level hierarchy e.g. at (0,0) (2,0), (0,2), (2,2) and the placer still loses the hierarchy. One of the first level RPMs is place correctly, the others float at no particular relative placement. Perhaps (again) it is something to do with LUT5's.
I see that RLOC_ORIGIN now works although I don't understand how -- when I open the synthesized DCP and try to view RLOC_ORIGIN property on the particular module, it does not show up in the module property view. Perhaps it is there but there is a bug in the module property viewer that it does not display RLOC_ORIGIN properties.
With (1) fixed and esp. with RLOC_ORIGIN working, perhaps I can start using RPMs for my CPU datapaths in UltraScale again. (I'll have to check if Xilinx has fixed the "multicolumn RPM with LUT-RAM" bugs.)
Also, I still miss the LUT_MAP synthesis constraint.
11-08-2017 12:43 PM - edited 11-08-2017 12:49 PM
Also, Ray, it's been a while, but in the Vivado 2015.4 era, I saw cases where you specify a legal RPM, the placer gives RPM placement errors, then the design builds fine anyway, generating the pack you wanted in the first place.
And I saw cases where the design placed and routed OK, but when you opened the implemented design, the GUI gave RPM placement errors.
So I put RPMs aside for a while and have been using rectangular pblocks. See e.g. http://fpga.org/grvi-phalanx/ But in targeting AWS F1, which puts the client (custom) logic in a PR region spread across SLRs, even rectangular pblocks are not honored in surprising ways. (Check out SNAPPING_MODE and PR rules.) So for now I'm just constraining my F1 logic to clock region aligned pblocks. Sigh.
But now with RLOC_ORIGIN and hierarchical pblocks of LUT6_2s (and presumably LUT6s) apparently working, I may be able to start using RPMs again, in some carefully constrained ways. That will make for better QOR and more importantly more consistent implementation runs.
Sorry, I don't know what to make of your BFFMUX problem!
11-08-2017 02:02 PM
11-08-2017 02:08 PM
I also miss the LUT_MAP. My new work-around is to make a function that generates the INIT string. For example
function mux2_xor return bit_vector is
variable i: unsigned(3 downto 0);
variable vector: bit_vector(15 downto 0);
begin
--i0 xor i1; when invertB= false else i0 xnor i1
for addr in 0 to 15 loop
i:= to_unsigned(addr,4);
if invert_A0 then
i(0):= not i(0); --d0 input
end if;
if invert_A1 then
i(1):= not i(1); --d1 input
end if;
if i(2)='0' then --sel input
vector(addr):= to_bit(i(0) xor i(3));
else
vector(addr):= to_bit(i(1) xor i(3));
end if;
end loop;
return vector;
end function mux2_xor;
function lut_wire return bit_vector is
variable i: unsigned(0 downto 0);
variable vector: bit_vector(1 downto 0);
begin
for addr in 0 to 1 loop
i:= to_unsigned(addr,1);
vector(addr):= to_bit(i(0));
end loop;
return vector;
end function lut_wire;
begin
U1_6: LUT4
generic map(
init => mux2_xor)
port map(
O => L6,
I0 => ax(i), --d0
I1 => cx(i), --d1
I2 => lcl_sel,
I3 => bx(i) );
-- L5 <= bx(i);
U1_5: LUT1
generic map(
init => lut_wire)
port map(
O => L5,
I0 => bx(i) );