<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>Coding Relic - Latest Comments in Soft Errors Are Hard Problems</title><link>http://codingrelic.disqus.com/</link><description>Random Musings about Software in an Embedded World</description><atom:link href="https://codingrelic.disqus.com/soft_errors_are_hard_problems/latest.rss" rel="self"></atom:link><language>en</language><lastBuildDate>Wed, 08 Oct 2014 19:29:28 -0000</lastBuildDate><item><title>Re: Soft Errors Are Hard Problems</title><link>http://codingrelic.geekhold.com/2009/09/soft-errors-are-hard-problems.html#comment-1626162254</link><description>&lt;p&gt;Hi Baruch&lt;/p&gt;&lt;p&gt;The first step to take to understand the source of errors is try to get a physical bitmap of errors location in your DIMM and plot them on a relative map centered on the nearby bump (I'm assuming here that you are using flip chip package). This map of errors might tell you a lot about the origin of the errors. Keep in mind that alpha particles from natural radioisotopes won't travel beyond 100 microns. So you are correct, if alpha particles is the problem, they have to come from within the packaging. Check out the materials in the vicinity of the die. These materials can be tested individually in an alpha counter, that's usually how it's done.You'll have to talk to your package house and ask if they change material from one chip lot to the other since you're seeing chips with several errors and others that are error free. Try to track manufacturing lots as another clue. It is correct to state that testing the whole package for alpha particles emission won't help.&lt;br&gt;Cosmic ray (high energy neutrons) and thermal neutrons (only if there's abundance of Boron 10 in the die and packaging) are particles external to the chip. Don't worry about betas.&lt;/p&gt;&lt;p&gt;Olivier&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Olivier</dc:creator><pubDate>Wed, 08 Oct 2014 19:29:28 -0000</pubDate></item><item><title>Re: Soft Errors Are Hard Problems</title><link>http://codingrelic.geekhold.com/2009/09/soft-errors-are-hard-problems.html#comment-1623456823</link><description>&lt;p&gt;Alpha particles are so large that they tend not to penetrate well, so the packages of nearby chips will protect them from alpha particles emitted elsewhere within the room.&lt;/p&gt;&lt;p&gt;Beta particles (electrons) and cosmic rays can penetrate the package of chips, but those are from the solar wind not radioactive decay of anything local.&lt;/p&gt;&lt;p&gt;I'm not sure that a radiation counter will detect alpha particles from the packaging materials. The situation I've seen is with the bonding agent emitting alpha particles, which is a "goo" inside the package which cushions the die and the wires between the silicon die and the pins on the package. The particles which hit the silicon die can cause errors, while the plastic of the package surrounding the goo would tend to absorb the rest.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">DGentry</dc:creator><pubDate>Tue, 07 Oct 2014 09:49:09 -0000</pubDate></item><item><title>Re: Soft Errors Are Hard Problems</title><link>http://codingrelic.geekhold.com/2009/09/soft-errors-are-hard-problems.html#comment-1623025060</link><description>&lt;p&gt;I see plenty of soft errors in DIMMs and PCIe devices. The distribution is such that some components see no soft errors and others are seeing plenty. I believe that the external forces (various particles) are not even remotely relevant and it points me more to faulty devices.&lt;/p&gt;&lt;p&gt;The theory of radioactive components is super interesting in that regard and I'm wondering if there is a way to check for it. What sort of options do I have to monitor for radioactive emissions in devices? Will they mean much if I can only test outside the cases, besides the racks? Are the failures expected to happen in the device with the problematic compound or will it affect (more weakly) nearby devices as well?&lt;/p&gt;&lt;p&gt;Just noticed the article is 5 years old... doh. I still wonder about my above questions if you care to think about that.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Baruch Even</dc:creator><pubDate>Tue, 07 Oct 2014 04:24:23 -0000</pubDate></item><item><title>Re: Soft Errors Are Hard Problems</title><link>http://codingrelic.geekhold.com/2009/09/soft-errors-are-hard-problems.html#comment-20077746</link><description>&lt;p&gt;Thank you for the detailed information!&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">DGentry</dc:creator><pubDate>Wed, 14 Oct 2009 18:50:57 -0000</pubDate></item><item><title>Re: Soft Errors Are Hard Problems</title><link>http://codingrelic.geekhold.com/2009/09/soft-errors-are-hard-problems.html#comment-20076662</link><description>&lt;p&gt;Very good article Denton, thanks!&lt;/p&gt;&lt;p&gt;Actually cosmic ray don't come from the Sun, but from outer space.  It seems counter intuitive, but when there's a solar irruption, it results  in less cosmic rays hitting the ground. &lt;br&gt;Some the industry's main concerns right now are:&lt;br&gt;- multi bit upsets in memories (mostly SRAM): one particle creates several bitflips. As you know, most advanced ECC detect two and correct one error per word. Memory architecture should be carefully taken care of (using inteleaving or scrambling) to avoid multi bit flips within the same word. But as you say, it might not be feasible for small memory instances spread out within the ASIC.&lt;br&gt;- more dramatic that the previous effect: the increased sensitivity of FF and registers to Soft Error at smaller technologies (we're talking 65nm here). Mitigation techniques are not as obvious as for memories and can cost a lot of area and power to the designer's budget.&lt;br&gt;- assessing derating: not all error will actually affect the functionality of the circuit. Either the upset occurred in a part of the circuit not involved in the ongoing function, or the errors has been masked along its propagation path by either timing or logic constraint. Understanding derating for large and complex chips is not a simple problem!&lt;/p&gt;&lt;p&gt;A good overview of the source of the problem can be found in this quick youtube file: &lt;a href="http://www.youtube.com/watch?v=pXc8Xh_0WJo" rel="nofollow noopener" target="_blank" title="http://www.youtube.com/watch?v=pXc8Xh_0WJo"&gt;http://www.youtube.com/watc...&lt;/a&gt;  &lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Olivier</dc:creator><pubDate>Wed, 14 Oct 2009 18:23:21 -0000</pubDate></item><item><title>Re: Soft Errors Are Hard Problems</title><link>http://codingrelic.geekhold.com/2009/09/soft-errors-are-hard-problems.html#comment-16883212</link><description>&lt;p&gt;Here's a reference: "How Cosmic Rays Cause Computer Downtime"&lt;br&gt;&lt;a href="http://www.ewh.ieee.org/r6/scv/rl/articles/ser-050323-talk-ref.pdf" rel="nofollow noopener" target="_blank" title="http://www.ewh.ieee.org/r6/scv/rl/articles/ser-050323-talk-ref.pdf"&gt;http://www.ewh.ieee.org/r6/...&lt;/a&gt;&lt;br&gt;See slide 34.&lt;/p&gt;&lt;p&gt;Search for [stuck latch cosmic ray] on Google.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Some Dude</dc:creator><pubDate>Fri, 18 Sep 2009 13:45:34 -0000</pubDate></item><item><title>Re: Soft Errors Are Hard Problems</title><link>http://codingrelic.geekhold.com/2009/09/soft-errors-are-hard-problems.html#comment-16866791</link><description>&lt;p&gt;That is very interesting, I had not heard of cases requiring a power&lt;br&gt;cycle to recover after a soft error.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">DGentry</dc:creator><pubDate>Fri, 18 Sep 2009 06:49:04 -0000</pubDate></item><item><title>Re: Soft Errors Are Hard Problems</title><link>http://codingrelic.geekhold.com/2009/09/soft-errors-are-hard-problems.html#comment-16861009</link><description>&lt;p&gt;One particularly horrible failure mode I've heard about is a "stuck bit" where the value in the bit cannot be changed after a particle hits it, requiring a power cycle.&lt;/p&gt;&lt;p&gt;I work for a company that designs and sells equipment that contains various ASICs supplied by third parties.  We test our products and the ASICs they contain by renting time in a facility that has a source of high energy particles (e.g. neutrons), placing our products in the path of the particle beam, and then operating the product with the beam aimed at various ASICS.  The results are interesting.&lt;/p&gt;&lt;p&gt;Perhaps this is an extreme way to test the code that handles parity errors in various memories on the ASICs, but until you do it how do you know that the ASIC either detects parity errors, or in the case of ECC, corrects them?  Some ASIC vendors are better than others.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Some Dude</dc:creator><pubDate>Fri, 18 Sep 2009 01:19:20 -0000</pubDate></item></channel></rss>