BRAM gen pipelining incorrect?



So I have a block mem gen (v8.3) BRAM with primitive and core regs enabled, and a 3 stage pipeline. The total BRAM usage is 60 BRAM as the RAM is 16x128k in size. The indicated mux size is 32x1. The PG058 doc suggests the architecture is a single FF between each 2x1 mux. This is what I wanted as I wanted it to improve timing on the output, as so:



As a bonus, the PG also seemed to suggest the inputs on Port A would be pipelined too :


However this was not what I saw after synthesis. The inputs to the BRAM were not pipelined at all. Also, the output pipeline was not as described, see this timing path. There is the BRAM output, then LUT-MUX-LUT-SRL-FF. PResumably the SRL is implenenting the 3 clock delay from the pipeline that I observe in sim. Then the final FF is the core output reg. But this isn't what I want at all. An SRL as opposed to a chain of FF will not improve my timing surely? Whats going on guys?






