A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5
What makes Firefox more susceptible to bitflips than any other software? Wouldn’t that mean that 10% of all software crashes are caused by bitflips and it just depends what software you are running when that happens.
Programs that use more memory could be slightly more susceptible to this sort of thing because if a bit gets randomly flipped somewhere in a computer’s memory, the bit flip more likely to happen in an application that has a larger ram footprint as opposed to an application with a small ram footprint.
I have device that has ECC ram and I can keep it online and applications running for well over 18 months with no stability issues.
However, both my work computers and my personal computer start to become unstable after about 15 to 20 days. And degrade over the course of 1 to 2 years (with a considerable increase in the number of corrupt system files)
Firefox and chrome start to become unstable after usually a week if they have really high memory usage.
I don’t think they’re arguing that Firefox is more susceptible to bit flips. They’re trying to say that their software is “solid” enough that a significant number of the reported crashes are due to faulty hardware, which is essentially out of their control.
If other software used the same methodology, you could probably use the numbers to statistically compare how “solid” the code base is between the two programs. For example, if the other software found that 20% of their crashes were caused by bit flips, you could reasonably assume that the other software is built better because a smaller portion of their crashes is within their control.
Interesting metrics to measure, but since I have no reference to how many crashes are caused by bitflips in any other software, it’s really hard to say if Firefox is super stable or super flaky.
No, the exact % depends on how stable everything else is.
Like a trivial example, if you have 3 programs, one that sets a pointer to a random address and tries to dereference it, one that does this but only if the last two digits of a timer it checks are “69”, and one that never sets a pointer to an invalid address, based on the programs themselves, the first one will crash almost all the time, the second one will crash about 1% of the time, and the third one won’t crash at all.
If you had a mechanism to perfectly detect bit flips (honestly, that part has me the most curious about the OP), and you ran each program until you had detected 5 bit flip crashes (let’s say they happen 1 out of each 10k runs), then the first program will have something like a 0.01% chance of any given crash being due to bit flip, about 1% for the 2nd one, and 100% for the 3rd one (assuming no other issues like OS stability causing other crashes).
Going with those numbers I made up, every 10k “runs”, you’d see 1 crash from bit flips and 9 crashes from other reasons. Or for every crash report they receive, 1 of 10 are bit flips, and 9 of 10 are “other”. Well, more accurately, 1 of 20 for bit flip and 19 of 20 for other, due to the assumption that the detector only detects half of them, because they actually only measured 5%.
That seems like a broad generalization, and for specialized software that requires newer hardware, you’d expect to find the rate of bitflips crashes much lower than 10%. You could argue that since Firefox is supported on older operating systems, longer than the support lifetime of the OS [1], it’s likely Firefox is being used specifically to get the last bit of life out of the hardware before it gets trashed.
What makes Firefox more susceptible to bitflips than any other software? Wouldn’t that mean that 10% of all software crashes are caused by bitflips and it just depends what software you are running when that happens.
Programs that use more memory could be slightly more susceptible to this sort of thing because if a bit gets randomly flipped somewhere in a computer’s memory, the bit flip more likely to happen in an application that has a larger ram footprint as opposed to an application with a small ram footprint.
I’m still surprised the percentage is this high.
This checks out with Linus Torvalds saying most OS crashes across linux AND windows are caused by hardware issues, and also why he uses ECC RAM.
Honestly yeah it’s 100% checks out.
I have device that has ECC ram and I can keep it online and applications running for well over 18 months with no stability issues.
However, both my work computers and my personal computer start to become unstable after about 15 to 20 days. And degrade over the course of 1 to 2 years (with a considerable increase in the number of corrupt system files)
Firefox and chrome start to become unstable after usually a week if they have really high memory usage.
Can confirm, my linux server with ECC RAM has 1040 days of uptime now without a single issue.
I don’t think they’re arguing that Firefox is more susceptible to bit flips. They’re trying to say that their software is “solid” enough that a significant number of the reported crashes are due to faulty hardware, which is essentially out of their control.
If other software used the same methodology, you could probably use the numbers to statistically compare how “solid” the code base is between the two programs. For example, if the other software found that 20% of their crashes were caused by bit flips, you could reasonably assume that the other software is built better because a smaller portion of their crashes is within their control.
Interesting metrics to measure, but since I have no reference to how many crashes are caused by bitflips in any other software, it’s really hard to say if Firefox is super stable or super flaky.
No, the exact % depends on how stable everything else is.
Like a trivial example, if you have 3 programs, one that sets a pointer to a random address and tries to dereference it, one that does this but only if the last two digits of a timer it checks are “69”, and one that never sets a pointer to an invalid address, based on the programs themselves, the first one will crash almost all the time, the second one will crash about 1% of the time, and the third one won’t crash at all.
If you had a mechanism to perfectly detect bit flips (honestly, that part has me the most curious about the OP), and you ran each program until you had detected 5 bit flip crashes (let’s say they happen 1 out of each 10k runs), then the first program will have something like a 0.01% chance of any given crash being due to bit flip, about 1% for the 2nd one, and 100% for the 3rd one (assuming no other issues like OS stability causing other crashes).
Going with those numbers I made up, every 10k “runs”, you’d see 1 crash from bit flips and 9 crashes from other reasons. Or for every crash report they receive, 1 of 10 are bit flips, and 9 of 10 are “other”. Well, more accurately, 1 of 20 for bit flip and 19 of 20 for other, due to the assumption that the detector only detects half of them, because they actually only measured 5%.
That seems like a broad generalization, and for specialized software that requires newer hardware, you’d expect to find the rate of bitflips crashes much lower than 10%. You could argue that since Firefox is supported on older operating systems, longer than the support lifetime of the OS [1], it’s likely Firefox is being used specifically to get the last bit of life out of the hardware before it gets trashed.
https://www.theverge.com/tech/880781/mozilla-is-dropping-firefox-support-for-windows-7-and-8 ↩︎