cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Anonymous
Not applicable
4,554 Views

MPMC5.04 -> Possible Crazy Bug Found

Hi All,

 

I have been working a bug in a graphics controller core for the last few days.  The bug leads me to think there is a random problem with MPMC5.04.

 

My system is a V4FX20 with 32BIT DDR connected to MPMC5.04.  I have the following ports:

Port1 is a 32bit NPI PIM for TFT display

Ports 2 and 3 are PLB's for the PPC405

Port4 is 32 bit NPI PIM for my graphics controller.

 

The graphics controller reads and writes from the TFT's frame buffers stored in memory.  The TFT is 16bit (2 bytes per pixel).

 

I am preforming a specific operation of drawing a white vertical line in a buffer from pixel location (3,0) to (3,10).  The video buffer is located at 0x01A00000 in DDR.  I start the first write transaction at 0x01A00006 (X=6=3pixels), I mask the address correctly to align it to a 4 byte boundard and write to the memory location, using the WrFIFO_BE = 0xC to mask off the correct pixel (upper nibble of word).

 

I understand all about the MPMC special cases, safe write mode, address alignment, etc.  I have been very careful to make sure I meet all the requirements.

 

 

untitled.JPG

(Note:  the NPI_ADDX in the above image, is not actually the addx being fed to the NPI PIM.  The actual addx has the lower two bits masked to zero for alignment)

 

This process works fine in the simulator and generally works well on hardware, however at very random times I will see adjacent pixels light up.  This mainly happens when writing 0xFFFFFFFF to memory.  I can run the project in ModelSIM and not see this problem.  I verify the pixels have been WRITTEN to memory, and they are not display artifacts or issues caused by the TFT.

 

When I change from BRAM FIFO's on PIM4 to SRL FIFO the problem does not exist.

 

This leads me to think there is a bug in MPMC causing a write to memory at an address that is off by 1 byte at random times when using a NPI interface.

 

Can someone please point in the direction of how to debug such a beast?.. 

 

Thanks

Lachlan.

 

 

 

 

 

 

0 Kudos
6 Replies
austin
Scholar
Scholar
4,544 Views
Registered: ‎02-27-2008

L,

 

Odd or weird behavior is more usually the result of not enough timing slack.  In the verbose timing report, what sort of slack is reported?

 

What happens when you warm up the device (just restrict the airflow with a box, or use a gentle added heat from a heat source -- watch out that you do not exceed the junction temperature limit, or else you will not be testing if it works or not at maximum temperature!).

 

If you (gently) cool the device, do thew error disappear?

 

If heat/cold affects the problem, then first look for a timing problem.

 

Timing issues may result from excess system jitter, which may be related to the decoupling capacitors.  Check against the user's guides on the bypassing solution and components used.  Check how much ground bounce you have.  It may be you just need to decrease the timng period constraint by 100 to 200 ps and re-place and route the design.

 

 

 

Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos
Anonymous
Not applicable
4,496 Views

Hi Austin,

Sorry for the belated reply.

 

I found the timing slack as 3.048nS in my design.  As I heated the board up to about 50°C the suspect pixels started getting less frequent.(strange!)

 

I have performed unit testing with the IP cores on several different boards, and different devices (V4FX12, V4FX20 and V5FX70T).  All unit tests showed the same problem. (The problem is still classified as DDR writes occurring at addresses which are slightly wrong, and its possible the NPI_WrBE signal is to blame)

 

I conclude that there is a problem with the IP core, causing a marginal timing problem when connected to the MPMC.  The IP core in question is a graphics processor, and I use about 26 DSP48's inside this module, running at high frequency.  The V4FX20 gets very hot when all DSP48's are running, however if I replace their functionality with a pure logic implementation, I do not get as much heat, but I still get the same problem.

 

The only solution is to switch to SRL FIFO's for the MPMC's NPI pipeline, and avoid using BRAM FIFO's. 

 

Cheers Anyhow.

Lachlan.

 

 

0 Kudos
austin
Scholar
Scholar
4,490 Views
Registered: ‎02-27-2008

Lachlan,

First, at 65nm and smaller, hot is often faster than cold (for interconnect). It is physics, and how the interconnect is designed: it doesn't always get slow as the device heats up, and not every device has hot=fast. Some are hot=little slow, some are hot=no change. Look for unconstrained paths.

But, since it got better (slightly), it is far more likely that it is signal integrity, as the IOS do get slower with temperature (only the core devices behave differently, now).

Look at undershoot, overshoot, ringing, and measure the jitter on a clock from the inside of the device (probe it, and measure it).

As Peter Alfke used to point out: if the car stops, check the gas, don't start tearing apart the engine.

Especially with any DDRn design, signal integrity is either good, or not so good. Data errors are the give-away.


Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos
dylan
Xilinx Employee
Xilinx Employee
4,477 Views
Registered: ‎07-30-2007

Lachlan,
Can you show a closeup chipscope addreq/addrack signaling? The wrpush should be just after the addrack.

Next step after this triple-check of protocol will likely be to address Austin's concerns. A bug in the V4 controller is fairly unlikely at this point due to its maturity. Since the whole write datapath is synchronous to a single clock, there's also not much that can be done incorrectly from a static timing perspective that would create timing-like failures that wouldn't show up in simulation.

Theres some easy ways to play with SI- adjust the IO standard, (say from _II to I_ and vice-versa), Try turning on or off DCI (if available). SDRAM Memory reduced drive (if available).

Dylan
0 Kudos
Anonymous
Not applicable
4,466 Views

Hi Dylan,

 

I am using 32bit NPI and 32bit DDR I am using the special case for write byte mode.

 

All of the NPI signals are synchronous out of my Core.  The problem is clearly bytes written 'just off' from the target address to the DDR, this is caused by MPMC.  The problem only occurs on a NPI port connected to my graphics controller, all other aspects of the sytem work good.  The problem is the same on two different boards (V4FX12 and V4FX20).

 

Moreover the results are random in nature and not easy to capture and disappears when SRL FIFO's are used for the NPI pipeline.  Like Austin said, it *must* be a timing problem, I agree too that the design is very mature, and a bug in MPMC5.04 is not really possible. 

 

Having said that I would like to try and find out why this happens, but I am finding it hard to make sense of the MPMC to the level where I can debug with certanty.

 

Cheers

Lachaln.

 

0 Kudos
austin
Scholar
Scholar
4,463 Views
Registered: ‎02-27-2008

Or a signal integrity problem....

Have you tried to reduce the strength of the IOs switching? Reduced the IO supply voltage? See if the IO switching is creating jitter problems, which cause the timing to be un-met....

Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos