04-07-2013 10:36 PM
Can some one tell me if the ECC support on the Zynq has been tested and works and/or has any known faults?
In particular I am trying to test if it works on the Zedboard and believe I have it enabled (the width set 16bits, ECC enabled).
Booting up the kernel with ecc=on :
[ 0.000000] parse_tag_cmdline: tag cmdline (@c000011c)=root=/dev/mmcblk0p3 rootwait
[ 0.000000] setup_arch: boot_command_line= ecc=on default_command_line=console=ttyPS0,115200n8 rw earlyprintk lpj=1000 rw root=/dev/mmcblk0p3 rootwait
[ 0.000000] bootconsole [earlycon0] enabled
[ 0.000000] Memory policy: ECC enabled, Data cache writealloc
[ 0.000000] PERCPU: Embedded 7 pages/cpu @c06ad000 s4832 r8192 d15648 u32768
I am testing it by not clearing a patch of memory and when I read that memory under Linux I get a fault:
[ 3539.090000] Unhandled fault: external abort on non-linefetch (0x1018) at 0x4009eff0
I can then write to that location and read back what I wrote with any faults afterwards - so that looks good.
**BUT** the register CHE_ECC_STATS_REG_OFFSET at 0xf80060f0 is always zero - i.e. the count of ECC errors is not increasing.
As this is the Zedboard it's running a development chip so if there is a hardware fault it might be fixed in the production version - but I haven't seen any mention of a problem in the Errata.
Also does any one have any info on what support there is in the Linux kernel for DDR ECC?
It appears it just generates a fault (perhaps clearing the above register?) but not reporting it anywhere.
Finally if this hardware has been validated can some one tell me the procedure used? As correctable errors are automatically fixed - how do you create them in order to confirm they are fixed?
I certainly don't want to wait around for a cosmic ray to hit.
Am I supposed to pull out a radioactive source or something?
05-02-2013 04:42 PM
From Jangi on the Web Case she/he claims that ECC works - but the status register might not...
So will await clarification of this with much interest.
The key problem that I see is we try to say "Here buy this ECC protected box, sorry we can't demonstrate that it works, just trust us that when that cosmic ray hits it will protect you ..." is not going to make for a good sales line. :-)
Thanks Austin - see what turns up next week.
05-03-2013 07:36 AM
This has gotten a few folks eye, so there are people actively looking at it right now.
Xilinx never asks anyone to just "trust us." We will always provide what is required for you to solve your problem, and meet your system requirements.
We are the ONLY company, in the world, to publish our FIT rates, soft,a nd hard, every quarter. And, we state the standards we used to get those numbers. Time after time, the numbers get (re-) verified by those who are not able to even "trust" that the published informatiojn is accurate and true. So, littel things like the ECC counter are of great importance to us, as we recognize our devices get designed into all kinds of safety critical systems, and I for one, do ride in high speed trains, and in airplanes (just two of the many places we get designed into).
I want to know (personally) where we are, and what we are doing. I enjoy talking to the ICE train engineer in Germany, the airplane engineer in Washington state, the 1 MW windmill systems engineer, etc. I like to know that all these systems are using proven hardware and IP to solve their problems, and keep me safe.
05-05-2013 04:19 PM
Thanks Austin - that's a pretty gun-ho attitude.
Love to see you guys deliver on that high bar.
At this stage we are thinking about building with ECC enabled and if we can
verify it later - then we can push that feature otherwise as long it is not actually
causing problems it's probably a hidden benefit. So please let me know
if we can verify this.
05-06-2013 07:38 AM
I will provides updates as they become available.
I am surprised that you haven't requested the soft error testing results.
Did you miss that offer?
I can't imagine that you are interested in the ECC_counter, and you do not care to know the soft error behavior of the device!
05-06-2013 04:37 PM
Ta. I think I missed that offer- yes you are right I am interested in soft error testing results and I will ask about them in the Web case. - BTW what are they?
05-06-2013 04:53 PM
Twice as good as the initial estimate,
Sorry, but I had to say that. We had estimates from Arm, and we came in at almost exactly half of their estimates.
That said, it depends on many factors, but using both cpus, both fpus, all peripherals, all caches, OCM memory we come in under a few hundred FIT (entire processor system). If you have a specific target to meet, by consulting your FAE, we can show you how to get it under your goal. All errors result in exceptions or interrupts. The silent data corruption (no exception, no interrupt) is less than about 15 FIT in our testing. All the details can be discussed with your FAE.
If you have any difficulties getting the info, let me know.
04-29-2015 11:10 PM
Does that mean that I need an interrupt to check the ECC status and counters? I was getting the same data abort exception (no parity) for the BRAM but only when I injected two or more errors. For a single, thus correctable, bit everything worked fine (including counters).
05-17-2015 11:27 PM