04-07-2013 10:36 PM
Can some one tell me if the ECC support on the Zynq has been tested and works and/or has any known faults?
In particular I am trying to test if it works on the Zedboard and believe I have it enabled (the width set 16bits, ECC enabled).
Booting up the kernel with ecc=on :
[ 0.000000] parse_tag_cmdline: tag cmdline (@c000011c)=root=/dev/mmcblk0p3 rootwait
[ 0.000000] setup_arch: boot_command_line= ecc=on default_command_line=console=ttyPS0,115200n8 rw earlyprintk lpj=1000 rw root=/dev/mmcblk0p3 rootwait
[ 0.000000] bootconsole [earlycon0] enabled
[ 0.000000] Memory policy: ECC enabled, Data cache writealloc
[ 0.000000] PERCPU: Embedded 7 pages/cpu @c06ad000 s4832 r8192 d15648 u32768
I am testing it by not clearing a patch of memory and when I read that memory under Linux I get a fault:
[ 3539.090000] Unhandled fault: external abort on non-linefetch (0x1018) at 0x4009eff0
I can then write to that location and read back what I wrote with any faults afterwards - so that looks good.
**BUT** the register CHE_ECC_STATS_REG_OFFSET at 0xf80060f0 is always zero - i.e. the count of ECC errors is not increasing.
As this is the Zedboard it's running a development chip so if there is a hardware fault it might be fixed in the production version - but I haven't seen any mention of a problem in the Errata.
Also does any one have any info on what support there is in the Linux kernel for DDR ECC?
It appears it just generates a fault (perhaps clearing the above register?) but not reporting it anywhere.
Finally if this hardware has been validated can some one tell me the procedure used? As correctable errors are automatically fixed - how do you create them in order to confirm they are fixed?
I certainly don't want to wait around for a cosmic ray to hit.
Am I supposed to pull out a radioactive source or something?
04-08-2013 08:11 AM
You should ask Avnet, or Digilent (as they design and sell the boards).
Or, post on zedboard.org
Xilinx manufactures the ZC702 pcb, so we could answer questions on what is tested, what is not.
But, since the zedboard is designed and tested by others, you will need to go ask them.
04-08-2013 03:28 PM
Thank you Austin for your reply I was glad to see the prompt response. But unless I am mistaken the DDR ECC error count register (0xf80060f0) is manufactured by Xilinx and they should know if it was tested and worked. If so they will need to have a method of testing it which they could share with a customer to confirm their coding has configured it as well.
If not perhaps you could tell me who would know? I suspect AVnet / Digilent have never run ECC on any of their boards and the only support is in the Xilinx EDK/SDK tools.
Incidentally I am using the Xilinx FSBL and u-boot to boot up the digilent kernel (with patches in all of them) so it is possible that some interrupt driven code is jumping in and clearing out the register with out reporting it. But I haven't seen any sign of that code. I will be looking at this interrupt code today.
What I want from Xilinx is to know if they have tested the ECC support and they can confirm it works (and how I could test it) so that I can use it in my companies product. Or if it doesn't work, whether we should drop this work on the Zynq chip.
04-09-2013 07:20 AM
As far as I know, everything has been verified.
We just had ARM here yesterday discussing items similar to yours (various registers that are not documented for use by the customer, but used for test). We do not spend millions of $ on a mask set until the RTL is 100% verified, and then the verification and characterization team has a great deal of work to do when the silicon gets here. That is almost two years ago, now.
If it isn't in the errata, then it is working.
04-09-2013 03:33 PM
That's what I would have thought - so how about sharing an example validation code?
It doesn't have to compile or even be documented just enough to show how you prove that ECC is working. Ideally a
sequence of ECC register read/writes that would demonstrate the basics and I can build on it from there.
At the moment:
1. I can't get the ECC error count register to ever change - perhaps because it doesn't work or because the test of reading uninitialised memory won't trigger it.
2. I always get a data abort - despite registering an interrupt handler for IRQ 92 (listed as Parity / SCU in TRM v1.5 Table 7-3 of Interrupt Chapter).
Can Xilinx offer me any help with this?
04-10-2013 08:45 AM
Yes. You may file a webcase (fastest way to get a response). You may request a visit from your distributor or Xilinx FAE by contacting the distributor, or the local Xilinx sales office.
If there is something standing in your way to build your systems (and place orders for our parts) we are all ears!
If this is an academic project, then help is requested through your professor who is registered with the XUP.
I have asked the verification team here specifically about the DDR ECC error count. As I said, we have just finished our radiation testing of Zynq, so I am interested to get the answer (even though we did not use the register -- we only used the exceptions (interrupts) to note what failed).
I would start by simplifying what is running. For example, if you are trying to see this under the public linux build, there are so many things that are probably preventing you from seeing it, that the list is too long to even start to describe. For one, with 96 interrupt types, and 7 exceptions, practically none of those is handled properly by the linux build (it is as small as they could make it, and as generic as possible, dealing with only what it needed to in order to work).
04-10-2013 04:05 PM
Thank - you.
Sounds like a web case is the way forward.
No this is not an academic project it is a commercial system. Our hardware team wants robustness and has required ECC and I have gotten pretty much everything demonstrated on the Zedboard and we believe ECC should work if we have the right configuration.
I have used the Xilinx tools to make a new FSBL with ECC, patched u-boot (it needs to write to the DRAM - the FSBL DMA initialisation routine appears to fail that operation with "fatal errors").
I can read the relevant ECC registers under Linux and they look okay: But the error count never changes and I get a data abort exception (no parity interrupt).
I suspect there are some undocumented setup required for proper ECC. e.g. Some enabling for the Parity interrupt and ECC counters OR ECC just doesn't work is not officially supported - (no method for a customer to develop / test an ECC system.)
04-11-2013 07:27 AM
Got the confirmation,
Yes, it has been verified, and it does increment.
We have asked further clarification, and I will post what I hear back.
A webcase is always the fastest way to get things solved: the hotline employees are graded (paid if you will) by how well they handle a case. That means fast, clear, and solved.
So, they are highly motivated.
Cases that linger are automatically brought to their manager's attention, and that sort of attention is bad for them.
Also, with a webcase, I can go into the system and see all the notes taken on the case, and if necessary, intervene.
I act as the Xilinx ombudsman, representing the customer if a customer finds themselves in a situation where they are unhappy, and the support system isn't working. If you will, I am a lightning rod. It isn't a recognized position -- I just took it. In this way I have learned what is working, and what is broken, and helped fix things. For me, it is part of what a professional at my level should do (so Xilinx may be even more successful). It will be 15 years this July, and I think I have worked to help Xilinx be successful, and will continue to do so (as long as they will have me).
If you wish to discuss the latest SEU testing of Zynq (the ARM Cortex(tm) system has been radiation tested), send me an email at email@example.com. We now have FIT rates, AVF, etc.
04-11-2013 11:00 PM
Thanks Austin, I have taken your advice and raised a WebCase #9644504
I took the trouble to verify the same issues are valid with out Linux running by accessing the memory under u-boot and got
the same results except under u-boot it hangs as the u-boot doesn't handle the data abort.
But writing to the uninitialised location prevents it from happening (you can read back the data after the write) and the various ECC registers as described in section 10.8 Error Correction Code of the TRM have the correct values as dumped out from u-boot.
I am guessing not many people are using ECC in their Zynq designs so far, even though the Xilinx tool has an option to generate parameters for that set up.
As an additional piece of issue I note that DDREcc_Init() isn't running to completion (I suspect this is why I had to modify u-boot to initialise all the DRAM for it to work under ECC mode). Here's some debug from fsbl running with DEBUG enabled showing that the DDREcc_Init() routine isn't working - see the PCAP_DMA_TRANSFER_FAIL message below.
Xilinx First Stage Boot Loader
Release 14.3 Apr 10 2013-10:12:30
Devcfg driver initialized
Silicon Version 1.0
et the loopback bit
PCAP MCTRL F8007080: 00000010
FATAL errors in PCAP A8131012
PCAP MCTRL F8007080: 00000000
DDR Init done for ECC
Check_ddr_init - wrote 0xAA55AA55 to 0x100000 and got 0xAA55AA55
Check_ddr_init - wrote 0xAA55AA55 to 0x200000 and got 0xAA55AA55
Boot mode is SD
SD: rc= 0
SD Init Done
Flash BaseAddress E0100000
Reboot status register 0x60000000
ImageStartAddress = 00000000
PartitionNumber = 00000000
flash read base addr E0100000, image base 0
image move with partition number 0
mageAddress = 0x0
Partition hdr for 0: 9C0
Image Word Len:0000DA3C
Data Word Len: 0000DA3C
Partition Word Len:0000DA3C
Partition Start 000009C0, Partition Length 0000DA3C
Source addr 00010A80, Load addr 04000000, Exec addr 04000000
Start transfer data into DDR
Get next partition header
mageAddress = 0x0
Partition hdr for 1: A00
Next Header dump:
Image Word Len:00000000
Data Word Len: 00000000
Partition Word Len:00000000
There are no more partitions to load
end of partition move, reboot status reg 60000000, Next partition 0
In FsblHookBeforeHandoff function
FSBLStatus = 0x1
U-Boot 2012.04.01-svn2293 (Apr 08 2013 - 02:21:42)
U-Boot code: 04000000 -> 040315E8 BSS: -> 04074368
monitor len: 00074368
TLB table at: 0fff0000
Top of RAM usable for U-Boot at: 0fff0000
Reserving 464k for U-Boot at: 0ff7b000
Reserving 4160k for malloc() at: 0fb6b000
Reserving 36 Bytes for Board Info at: 0fb6afdc
Reserving 120 Bytes for Global Data at: 0fb6af64
New Stack Pointer is: 0fb6af58
Bank #0: 00000000 256 MiB
relocation Offset is: 0bf7b000
WARNING: Caches not enabled
clearing 05000000 (83886080) bytes from 00000000
monitor flash len: 000368F0
Now running in RAM - U-Boot at: 0ff7b000
MMC: SDHCI: 0
Using default environment
Hit any key to stop autoboot: 0
04-11-2013 11:05 PM