UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Contributor
Contributor
626 Views
Registered: ‎04-18-2017

How does sem ip detect errors?

Jump to solution

Hello xilinx engineers.

From PG036, I know that SEM IP has three repair methods, as shown below.微信图片_20190623205743.png

 

1、SEM can fix 1-bit errors in repair mode.I want to know how SEM detects 1-bit errors. Is it detected by FRAME_ECC?

 

2、In enhanced repair mode, SEM can detect and repair adjacent 2-bit errors. How does it detect 2-bit errors?

 

3、SEM can detect and fix all errors in replacement mode, how does it detect multiple errors?

 

Thank you

0 Kudos
1 Solution

Accepted Solutions
391 Views
Registered: ‎09-17-2018

Re: How does sem ip detect errors?

Jump to solution

OK,

As I am no longer a Xilinx employee (their loss), I do not need to explaim anything further to you.  I am just an engineer who enjoys teaching what I know (while I have my morning coffee). You may request a reliability briefiing under NDA if you wish to learn more about how I engineered the SEU mitigation solution.  Or, you could actually go lookup why error correcting codes behave as stated (as no one at Xilinx remains who knows any longer, etc).

l.e.o.

13 Replies
Moderator
Moderator
549 Views
Registered: ‎06-05-2013

Re: How does sem ip detect errors?

Jump to solution
It uses FRAME_ECC primitive to detect those bits. You can refer to the following AR https://www.xilinx.com/support/answers/54350.html

Thanks
Harshit
-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
538 Views
Registered: ‎09-17-2018

Re: How does sem ip detect errors?

Jump to solution

The way it works,

The read back CRC is detected as not matching, so at least one bit is upset (100% of upsets less than 32 bits are detectable, and more than 32 bits are detected with an uncertainty of 2^32, or basically, any number of upsets is detected),

The SEM IP then checks each FRAME_ECC syndrome to find the frame in error.  Then if it is one error (the syndrome tells you that) you flip the buit back.  If the CRC does not clear, then it is 3, 5, 7 bit flip (odd, greater than 1), so the SEM gives up, uncorrectable status, and halts (unless it has frame replace where ity just rewrites the whole frame).

If the syndrome indicates an even number of flips (greater than 1, even), SEM IP finds the two adjacent bits, and flips them back (it has to do trial and error, as it does not know which pair exactly flipped).  If again it is 4, 6, 8 etc. it gicves up unless set for frame replace.

Looking at beam test data I discovered upsets occur adjacent if more than just one, and infrequently more than 2), so I invented the algorithm to fix it.  Now Xilinx gets to support it, and explain it (which they seem to be learning how to do, albeit slowly).

Nore that Xilinx FPGA devices are THE ONLY devices which can detect and correct 100% of upsets -- no one else has been able to do this (although Intel claims it works in Stratix 10, but no 3rd party testing has been done, same claim for S5, but it was found it did not work by beam testing).  This is why the devices get used in security and safety applications (like the cockpit controls in a Boeing 757).

Would you fly in a plane that had devices that couldn't tell you they were OK?  Good news is the FAA will not certify systems that cannot deal with soft errors.

l.e.o.

Tags (1)
Contributor
Contributor
479 Views
Registered: ‎04-18-2017

Re: How does sem ip detect errors?

Jump to solution
Thank you very much for your reply. I have been busy for a while and have not checked the news in time, sorry. Your answer is very helpful to me, but i still don't quiet understand it.

1、In repair mode, it just can detect and correct single bit in a frame using SEC (Single Error Correct) algorithm or detect double bit errors using DED (Double Error Detect) algorithm. As far as i know, the FRAME_ECC syndrome can detect and correct one error because the synword[6:0] and synbit[4:0] will tell me which word and which bit was upset. Thus, we can repair it easily by read-modify-write the fault-frame. If detect double error were detected, the syndrome[12] will equal to zero, and the syndrome[11:0] will be non-zero. In this case, i don't know which bit has been flipped, i just know double upsets have occurred. Therefore, the replacement method will need to recovery the double upsets.

2、In replacement mode, if upsets were detected, a golden frame will be read from external memory to replace the fault-frame. But i don't know how the SEM IP know which frame was damaged? The single or double upsets can be detected by the FRAME_ECC, what about the 3,4,5,6,7...how does the SEM IP know which frame was damaged?

This problem has been bothering me for a long time, making me feel desperate.
0 Kudos
Contributor
Contributor
469 Views
Registered: ‎04-18-2017

Re: How does sem ip detect errors?

Jump to solution
Thank you for your answer, I have seen this before.
But I still don't understand how to detect multiple errors, such as 3, 4, 5, 6, 7, 8... Since SEM IP can use the replacement method to repair errors, there must be a way to detect multiple errors. Before that, I thought as long as syndrome[12:0] is non-zero, it means that there must be errors. But i was wrong because FRAME_ECC only guaranteed to detect up to two errors.
Do you know how the SEM IP detects multiple errors?
0 Kudos
455 Views
Registered: ‎09-17-2018

Re: How does sem ip detect errors?

Jump to solution

The CRC32 detects 100% of errors up to 31,

and with 1/2^32 of any greater than 31 errors not being recognized.

It is a 'turbo code' if you will, using BOTH the FRAME ECC SECDED, AND the CRC32 to be essentially bulletproof.  As a student of one of Claude Shannon's students (two of my information theory proffessors at Berkeley were at MIT with Claude for their PhD's), I take especial delight in getting it right (which I did here).

l.e.o.

 

Contributor
Contributor
440 Views
Registered: ‎04-18-2017

Re: How does sem ip detect errors?

Jump to solution
Oh, it sounds amazing ! You know so much, you must be a very knowledgeable person. I am very grateful to you for solving my confusion. I am a newbie to turbo code. Can you explain how it uses FRAME_ECC and CRC32 to detect errors? Or can you give me some reference files?

As you said, CRC32 can detect up to 31 errors, so why does replacement say it can fix arbitrary errors? For example, if there are upsets greater than 32 bits, how is it detected?
0 Kudos
434 Views
Registered: ‎09-17-2018

Re: How does sem ip detect errors?

Jump to solution

CRC does not correct errors,

It only tells you that there is one (or more) errors, somewhere.

This is called the 'outer code' and tells us that we have errors, or tells us that we have 0 errors.

The inner code is a standard hamming single error correcting, double error detecting (SECDED) code.  It is unable to tell us anything more.  So, because we know from beam testing that errors are in a single bit (?90% of the time), the FRAME_ECC correcting value (syndrome) most likely pooints to the bit in the frame that flipped, so we flip it back.  In doing so, the syndrome should now be zero, and re-running the CRC32 should indicate no errors.  If the syndrome is still indicating we have an error,it means most likely the bit adjacent flipped (a two bit multiple bit upset).  The SEM IP tries to fix, and re-check the adjacent bits (do not know left or right).  If it succeeds (syndrome is zero, CRC32 is clean) then we are done correcting. At that point we have fixed >97% of all possible errors.  If not good, SEM IP halts in error.

The adjacent bit fix is called enhanced repair feature (if you enable it).  If you are worried about the last 3% of errors (3, 4 or more adjacent upsets), one may then use the frame replace feature of the SEM IP (fixes 100% of upsets) but requires a separate frame memory flase device.

You can search on the terms above to learn more (syndrome, hamming code, SECDED, etc.).

You may also lookup my patents at uspto.gov (Austin Lesea, inventor name).

l.e.o.

Highlighted
Contributor
Contributor
421 Views
Registered: ‎04-18-2017

Re: How does sem ip detect errors?

Jump to solution
Thank you for your answer.

I know how FRAME_ECC and CRC32 work. FRAME_ECC is a SECDED code. It does not guarantee detection of more than 2 bit errors, although sometimes it can detect errors greater than 2 bits. Therefore, error detection results greater than 2 bits are unreliable. The CRC32 detects 100% of errors up to 31, but it does not tell us which frame has been damaged. This is the problem that bothers me. I don't understand how SEM knows which frame is error in frame replace mode. Maybe I need to read your patent carefully to get the answer.

Thanks again
Tags (1)
0 Kudos
Contributor
Contributor
404 Views
Registered: ‎04-18-2017

Re: How does sem ip detect errors?

Jump to solution

1、In the 20 page of PG036, I found a description of FRAME_ECC. It says FRAME_ECC can reliably detect all odd-number errors. This makes me confused because FRAME_ECC can only detect up to 2 bit errors and greater than 2 bits are unreliable. Do you know why?

 

 

微信图片_20190801175340.png

2、I found  error correction latency in the 21 page of PG036. Different error correction latency in different repair modes. For example, repairing 1 bit in repair mode takes 610 microseconds at 100Mhz. I don't know how this  is calculated. In my opinion, it should be about the sum of the time to read a frame and write a frame. In my opinion, it should be about equal to the sum of the time to read a frame, write commands, and write a frame.

I roughly calculated it.

read one frame: 202 clock

write command: about 100 clock

write one frame: 202 clock

Assume that the clock period is 10 ns, the total error correction latency should be equal to (202+100+202)*10 ns = 5040 ns = 5.04 us, they differ by two orders of magnitude.

I am confused, do you know why?

微信图片_20190801175300.png

 

0 Kudos
392 Views
Registered: ‎09-17-2018

Re: How does sem ip detect errors?

Jump to solution

OK,

As I am no longer a Xilinx employee (their loss), I do not need to explaim anything further to you.  I am just an engineer who enjoys teaching what I know (while I have my morning coffee). You may request a reliability briefiing under NDA if you wish to learn more about how I engineered the SEU mitigation solution.  Or, you could actually go lookup why error correcting codes behave as stated (as no one at Xilinx remains who knows any longer, etc).

l.e.o.

Contributor
Contributor
355 Views
Registered: ‎04-18-2017

Re: How does sem ip detect errors?

Jump to solution

Thank you very much for your answers, which are very useful to me.

I don't think it's a good thing to leave xinlinx. Of course, this is for our newcomers, because I haven't seen engineers who are so knowledgeable and patient like you. For xilinx, I dare to say this must be a huge loss.

Dear friend, thank you again.

0 Kudos
334 Views
Registered: ‎09-17-2018

Re: How does sem ip detect errors?

Jump to solution

g@w,

They lost an engineer, but gained a new customer (:  All in all, I am much happier where I am now, doing far more useful work, enjoying it immensely.  My 20 years at Xilinx have created incredible value for me and my new employer.  After being part of IC designs from 220nm through 7nm and 100 Xilinx patents in my name, it was time for me to go be creative somewhere else.  After all, Moore' Law is dead, so my stint as an IC designer at Xilinx really had to end ...

Thank you for your kind words,

l.e.o.

 

0 Kudos
Contributor
Contributor
329 Views
Registered: ‎04-18-2017

Re: How does sem ip detect errors?

Jump to solution
I still have to say that this is their loss. I wish you good health and happy every day.
0 Kudos