pleroma.debian.social

pleroma.debian.social

A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5

In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger. Still, out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much. 2/5

In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%. This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem. 3/5

And to reinforce this estimate I've looked at the numbers we got from the users who run the memory tester after having experienced a crash: for every two crashes we think are caused by a bit-flip the memory tester found one genuine hardware issue. Keep in mind that this is not doing an extensive test of all the machine's RAM, it only checks up to 1 GiB of memory and runs for no longer than 3 seconds... and it has found lots of real issues! 4/5

And for the record I'm looking at this mostly on computers and phones, but this affects *every* device. Routers, printers, etc... you name it. That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job. 5/5

@gabrielesvelto holy shit, so what, you're just writing a gig of data and reading it back? and that works well enough?!

@gabrielesvelto My mind was going to cheap low-end hardware, but now you’re throwing expensive Apple Silicon SOC’s in the mix, it’s a bit harder to believe that they suffer from bitflips at the rates your are implying.

@stevenodb @gabrielesvelto low-end hardware might sometimes even be less likely to hit this because it's not even trying to be super fast. High-end hardware chasing the fastest speeds is pushing the limits of stability all the time.

@gabrielesvelto is it lots of different devices, each one experiencing rare crashes at random, or is there a small number of really shitty computers accounting for a large share of the crashes?

@guenther @gabrielesvelto ah yes, bitflips georg who lives in a plutonium mine and gets a thousand bit-flip related crashes every day is an outlier adn should not have counted

@gabrielesvelto and what is the ratio of people who ever get a (bit-flip) crash out of all those who opted into this telemetry?

@gabrielesvelto As a personal anecdote, I built a PC in 2017 with 16 GB of DDR4 RAM that I got from Amazon (Germany.) Had to return it after extensive testing with Passmark's free version of memtest86. Had failing bits. The replacement did pass the heavy testing. If there's one thing I wanted that PC to be was very stable and reliable.

Few years later got a second 16 GB kit to expand the PC to 32 GB. Had to return that kit as well, it also had errors. The replacement again passed the extensive testing. This is still the PC I'm writing from now in fact.

Manufacturers and their QA teams must be aware of their failure rates, but they likely do not care to save costs and make higher profits. They still sell kits with some failures, because not many users subject their PCs/RAM to the torture of these long RAM tests (4 full passes or more for sanity's sake takes hours.) And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately. From my experience, the "RAM Test" offered by Windows was an absolute joke. It never found anything on the kits that Memtest86 would find failures on in about 1 of any 2 runs.

I remember watching a Youtuber testing a gaming build he had just put together, and he used prime95 to test it for some minutes only. The computer did not crash and according to him that was fine enough for a gaming PC. I happen to disagree. In particular because in that run of his, even if Prime95 did not crash, it showed calculation error warnings. That could have happened because of RAM issues. In his view, just for gaming it was fine enough that Prime95 would not crash quickly, much better even if it endures some minutes. I disagree. Any calculation error warning from Prime95 is quite a hardware stability/reliability red flag, just as any finding from memtest86.

It is a failure of the industry that ECC RAM is still not standard at least for PCs, laptops, and cellphones. Maybe it should be standard for all consumer electronics in fact.

@raulinbonn

> And crashing here and there with normal usage is almost considered "normal" to some extent, unfortunately.

Are these people aware that a bit flip in some file system code could nuke somebody's hard drive?

@gabrielesvelto

@gabrielesvelto so will the next new feature in rust be some kind of memory checksumming? Almost like a virtual ECC

@gabrielesvelto might be hard to do as a compiled language feature, but any interpreter/VM could do it

@ShadSterling

Any compiler or interpreter could do this. After every store, do a clflush, load, compare, and panic if there's a mismatch.

Problem is this would explode executable size (that's a lot of extra instructions) and ruin performance (that's a lot of extra instructions, and also you're effectively disabling the entire CPU cache).

@gabrielesvelto

@gabrielesvelto

Do you know why Firefox lost the browser wars?

Because they always blamed everyone else — always.

It was always a browser extension’s fault, a hardware fault, or the user’s fault. Meanwhile, Chromium works — that’s it — it works.

You lose, and worst of all, you don’t even own up to it.

@NetscapeNavigator

I'm not sure how you got to be under the impression that Chrome never crashes, especially in the presence of defective RAM, but you're sorely mistaken.

https://ioc.exchange/@shac/116171850107280971

As for “it works”, I'm writing this post with LibreWolf, a Firefox variant. Obviously I wouldn't be able to do that if the browser didn't work.

@gabrielesvelto

@argv_minus_one @raulinbonn yes, the worst outcome of a bit-flip is when data that will be written to disk happen to overlap it, which then makes it all the way to the drive. And BTW this is one of the reasons why competent filesystems should always implement checksums for both data and metadata, as it increases the chances of detecting these issues early, before they do permanent damage.

@gabrielesvelto

Checksums will let you detect that there is a problem, but won't actually save your data. If a bit flip causes the file system driver to write to the wrong LBA or corrupt a key file system data structure or something, the damage will still be quite permanent unless you have a backup.

@raulinbonn

@dysfun yes, believe me when I tell you that I was as surprised as you are. We try to do a few different tests (like writing different patterns before reading them back) but in a nutshell, that's it. Write, read back, check if it matches.

@gabrielesvelto amazing. computers still surprise me sometimes

@guenther I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate.

@guenther generally speaking a single machine won't send a lot of crashes. It's very common that they only have one bad bit across their whole installed RAM. They'll hit it eventually, especially if it's in the lower address ranges, but not all of the time. And in order to crash some important data needs to end up there, like a pointer or an instruction.

@gabrielesvelto @guenther why would the lower address ranges be special? This is confusing to me. I can't imagine firefox runs on any OS without virtual memory (or ASLR), so it doesn't seem like that should correlate strongly with any physical aspect.

@dysfun@social.treehouse.systems @gabrielesvelto@mas.to when I have had bad ram memtest86 usually finds it within seconds, and it's nearly that easy in userspace tools. i wrote a daemon years ago to try and find cosmic rays on my work laptop overnight while idle, and bad ram tripped it instantly

(as an aside, the issues i frequently complain about with e.g. Firefox crashing and then traveling back in time to reopen my tabs from weeks or months ago also happens on machines with ECC ram)

@linear @gabrielesvelto i'm reminded of the memtest86 developer posting about using a heat gun to try and induce memory errors

@vathpela @guenther I meant in the lower *physical* address ranges, because it's more likely to be used early on even on a lightly loaded machine. I once had a laptop with a bad bit at the very end of the physical range, I would hit it only when running Firefox OS builds which were massive (basically building Firefox + a good chunk of Android's base system at the same time)

@gabrielesvelto @guenther That still seems really weird to me - why would firefox be likely to get a low physical address? If anything is likely to have a higher chance of getting that memory, I would think it would be the kernel (which has genuine lowmem requirements sometime for e.g. dma bufs and such on some platforms), but a userland process seems odd.

@vathpela @guenther oh it's not for Firefox specifically. Users with bad bits in lower address ranges will be more likely to encounter problems with *everything*, including the kernel. I also don't literally means the *lowest* ranges. Say, if you have a bad bit in the first GiB of physical memory you'll see its effects far more often than if you have it in the last one on a 32 GiB machine

@gabrielesvelto @vathpela @guenther Linux kernel has CONFIG_SHUFFLE_PAGE_ALLOCATOR to randomize which memory gets allocated first, which generally distros enable, but probably no one activates it by the boot param page_alloc.shuffle=y ;)

@vbabka @gabrielesvelto @guenther all boot params are policy failures :)

@gabrielesvelto I did not know that bit flips refer to reproducible bad ram issues....I thought they are random...

@adingbatponder people and research usually focused on random bit-flips caused by high-energy radiation and similar phenomenons. Actual RAM going bad is a poorly documented and researched problem, mostly because the industry doesn't care. This is a more extensive thread on the issue: https://fosstodon.org/@gabrielesvelto/112407741329145666

@gabrielesvelto This matches the kind of things I've heard from the Microslop people about their error data.

@gabrielesvelto Awesome! But do you report visually to the user that they might have bad ram? That would be a good user experience thing to do. Otherwise they will just get angry at Firefox.

@gabrielesvelto

Would there be any way of getting to that memory diagnostic information?

Because I've got an intel 13th gen laptop where firefox crashes constantly, and I'm trying to convince the vendor it's the main board.

@gabrielesvelto Is the firefox crash reporting stuff open source and available anywhere? I need to get something similar setup for @pidgin and obviously would love to be able to reuse existing work 😅

@vbabka If it is not enabled by default, then it is not important.

@gabrielesvelto @vathpela @guenther

@oleksandr @vbabka @gabrielesvelto @guenther no seriously, basing things on boot params should just be considered a bug. It's always a bad choice, usually thought to be necessary because of some other bad choice or trade-off.

@gabrielesvelto On Linux at least it is possible for the kernel to quarantine the physical pages containing bad bits. At a previous job we used this to remotely repair expensive appliances that would have required an onsite technician to swap out the whole unit.

It’s not widely used, and pinpointing the bad pages isn’t easy (or possible from userspace, afaik). But maybe now that DRAM is expensive again it could be improved.

@ericseppanen I would argue for widespread use of ECC memory. The price of doing it in hardware would be small, certainly smaller than the damage caused by bad memory. But even if one doesn't want to pay it there's always the possibility of using inline ECC in modern SoCs that have both an integrated memory controller and caches: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/psdk_rtos/docs/user_guide/developer_notes_ddr_inline_ecc.html

@gabrielesvelto ECC DRAM would improve things, but people have been arguing this for decades and if it hasn't happened yet I doubt there's much you or I can do to change that.

On the other hand, if someone with an interest in systems programming wanted to build an always-on bad DRAM detector, I would encourage them to try.

It would be a nice gift to the world to prevent all those existing computers from going to the landfill over a few bad DRAM pages soldered to the mainboard. It might even enable a secondary market for mostly-good-but-slightly-bad DRAM modules to be bought at a discount.

Here's one possible design of an always-on bad-memory handler:

- A kernel API for allocating physical pages.
- A kernel API to quarantine physical pages.
- A userspace daemon that periodically allocates, tests, and releases (or quarantines) physical pages.
- A heuristic for determining if physical addresses are actually bad (and not the result of system flakiness or software bugs).
- A way to persistently store lists of bad pages, and tooling to maintain those lists.
- A way to load the bad pages list at boot (kernel params exist, but the list could be quite large...)

I'm sure a few of these exist (my knowledge is quite out of date), but to the best of my knowledge nobody has put them all together in a form that could be enabled by default and then mostly left to run on its own forever.

@gabrielesvelto silly question since I am asking way beyond my own knowledge: is there any way to defensively program such that the software won’t crash when bad memory does a bit flip? Like any way to write the code to have a more graceful failure mode?

I realize this might be an astoundingly foolish question, and/or might require a defensive programming paradigm that no only diverges from norms but may be unfamiliar/inaccessible to standard coders.

@davidaugust @gabrielesvelto Servers normally use ECC memory (Error Correction Code) which detects such bit flips… but Intel decided that consumers don't need that.

@stuartl @gabrielesvelto oh wow. So no longer available?

@davidaugust @gabrielesvelto Never offered to consumers in the first place.

If you want ECC RAM, you buy a server.

@stuartl @gabrielesvelto ah, that makes sense. So available on servers. Now I understand.

@davidaugust @gabrielesvelto I think a big factor is that up to the turn of the century, RAM was expensive.

The 16MB needed to make Windows 95 run at a decent speed, cost a fortune. Windows NT and IBM OS/2 both ran okay-ish in 16MB, but really needed more.

Windows NT officially had minimum specifications of 12MB, but it ran like a 3-legged corgi with arthritis. Windows 95 on 8MB was no speed demon either (and 4MB barely got you a desktop… been there, done that!).

A RAM module from that era (72-pin SIMM https://en.wikipedia.org/wiki/SIMM) might've had 8 chips on it, each storing 4 bits of a 32-bit word. A ECC RAM module would have 9: the extra chip stores 4 bits of ECC coding data so it can detect (and possibly correct) bit errors in that 32-bit word.

Consumer PCs were a race to the bottom cost-wise, so things like ECC RAM got kept for servers which are usually less cost-sensitive.

@stuartl @davidaugust @gabrielesvelto that's only kinda true - intel has had workstation-class chips that support ECC memory for years, but ECC UDIMMs are *really* expensive, even moreso than RDIMMs usually found in servers. this was true even before the current RAM pricing debacle.
technically speaking the memory controller on several generations of intel desktop chips are physically capable of supporting ECC UDIMMs but they're got it disabled in software to enforce their artificial market segmentation 💢

@astraleureka @davidaugust @gabrielesvelto Yeah, the difference between "workstation class" and servers are quite small.

Intel Xeon family basically.

But yeah, the price of the ECC DIMMs is basically what kills it, for Joe Average that wants a cheap laptop, they see two machines, one several hundred dollars more than the other… all things "apparently" being equal (ECC is just a buzzword acronym to the uninitiated), they choose the cheaper option instinctively.

@astraleureka @davidaugust @gabrielesvelto

Hardest RAM I've ever had to source were DDR3 ECC SO-DIMMs.

The servers (https://www.supermicro.com/en/products/motherboard/A1SAi-2750F) that run this Mastodon instance use them, and they're as hard to source as rockinghorse crap and cost a small fortune.

@alienghic have you had any luck running memtest86+ (https://www.memtest.org)? if you've got a usb stick kicking around it's quite easy to install, they've got a tool that sets up the usb drive to boot into the memory test. if your memory is indeed bad and causing that many crashes you should see errors reported within a few minutes, maybe half an hour at most

@astraleureka

Debian will even install memtest as a grub option, and I ran a default run which passed.

@stuartl @davidaugust @gabrielesvelto ugh, I used to run an older variation of those Atom boards and yeah, ECC SODIMMs are somehow even harder to find/more expensive than UDIMMs. really not worth the increased expense unless you're severely space/power limited and have no other options

@alienghic @gabrielesvelto

Did you say "13th gen intel"? That's likely the CPU. 13th and 14th gen were faulty by design and prone to frying themselves during normal operation.

(And Intel's response to the situation has put me off of ever buying anything from them again. But that's a long rant for another time.)

@vathpela @oleksandr @gabrielesvelto @guenther it's meant for hardening and there's some performance trade-off, which is typical. But IMHO it's better if hardening options can be enabled just by boot parameters and not require a different distro kernel flavor.

@vbabka @oleksandr @gabrielesvelto @guenther you should be able to turn it on with a running kernel.

@vbabka @oleksandr @gabrielesvelto @guenther IMO it's never about boot time vs compile time, and always about being able to turn it on and transition in to it. Of course that's sometimes the hardest way and why we make bad trade-offs, but it also keeps us from being able to enable a lot of features we want on the boot path.

@vbabka @oleksandr @gabrielesvelto @guenther Obviously I have some bias here, but it's because people want things from booting that command line variability makes intractable.

@vbabka @oleksandr @gabrielesvelto @guenther and other OSes simply do not have this problem at all. It's optional.

@stuartl @davidaugust you can build desktop-class machines with ECC memory by picking the right CPU, the right motherboard and unbuffered ECC UDIMMs. I've been using desktop machines with ECC memory since 2012, the current one - my main workstation - has two 48 GiB unbuffered ECC DDR-5 sticks on a an ASUS PRIME B650-PLUS and a Ryzen 9700X. Before the RAMpocalypse began ECC UDIMMs were a bit more expensive than regular ones but not so much as to make them inaccessible.

@purpleidea we want to do that but the UI and UX design is a remarkable challenge. We don't want it to look like the many, many scams on websites that pop-over messages telling you your machine is borked.

@alienghic if you've got a Raptor Lake-based machine check if you've updated the BIOS and got the latest CPU microcode, but even then it could be your CPU acting up. See my old thread on the topic: https://mas.to/@gabrielesvelto/115939583202357863

RT: https://mas.to/users/gabrielesvelto/statuses/115939583202357863

@grimmy @pidgin absolutely! We used to rely on Google's projects but reusing them was hard so I've been driving a large effort of replacing them with easy-to-integrate tools. Here's the relevant crates:

https://github.com/rust-minidump/rust-minidump
https://github.com/rust-minidump/minidump-writer
https://github.com/mozilla/dump_syms/
https://github.com/EmbarkStudios/crash-handling

The last one is for integrating stuff into your project. The code we use to integrate this stuff in Firefox is still too tightly bound to our machinery but I plan on moving it out soon.

@vathpela How would that even work? Any memory allocated up until the point where you switch the setting will have to remain in place, negating the benefits or randomized allocation for everything that starts early. And both allocators would have to work with the same in-memory format, which may or may not be possible?

@vbabka @oleksandr @gabrielesvelto

@gabrielesvelto makes me want to replace all my memory with ECC stuff... if only RAM didn't cost 500% what it did a few months ago...

@novet one of the main reasons why I ran older era server-grade hardware for desktops up until recently - despite being older the server-grade performance was competitive (or even better) than current generation consumer CPUs, there were more PCIe lanes available, and ECC memory was supported. if you're located in the US, the cost for used server equipment was also significantly cheaper than brand new consumer gear - you could build a 8+ core 64GB+ machine for a fraction of brand new

I completely skipped the entire DDR3 and DDR4 generations of consumer gear, only buying a consumer-grade laptop with DDR5 now that ECC is standard functionality (at least on DIMM ... not between CPU and DIMM though)

@astraleureka
electricity must be insanely cheap over there...
@novet

@guenther @vbabka @oleksandr @gabrielesvelto right, like I said there are always difficulties and trade-offs. You might have to flip the switch and then re-start tasks or kexec or other things, or something else (who knows, someone would have to design it).

@gabrielesvelto You gotta tell the user somehow.

@gabrielesvelto @pidgin dang, we've been actively trying to not pull rust into our already overly complicated builds, especially on windows :-/

@vathpela @vbabka @oleksandr @gabrielesvelto @guenther sounds like a stupid comment. A running kernel has already completed the majority of memory allocations it will ever need, so toggling such an option by then would have no effect. Unless you want such toggling to force a complete realloc of all kernel memory, which would be even more stupid.

@grimmy @pidgin you can use the client-side tools without pulling them into your build. For generating minidumps you can use Google Breakpad or Crashpad which are both C++, but they're not very easy to integrate in a non-Google project.

@gabrielesvelto @pidgin Cool, I'll have to dig into this more. We're just C though, no C++, but that's much easier to pull in 😅

@gabrielesvelto @grimmy @pidgin don't you still rely on nodejs and therefore V8 in the build system?

@wyatt @pidgin @gabrielesvelto Which project are you referring to here? Pidgin has been a pure C application since 1998...

@wyatt @pidgin @grimmy I don't remember. There are definitely some linters that required it but I'm not sure about the build. In general we use xpcshell to run headless JS, but the whole thing is such a large contraption that there might be something that depend on node.

@gabrielesvelto @pidgin @grimmy i feel like that's what stopped my powerpc build last time.
haven't tried in a while though.

@wyatt @pidgin @gabrielesvelto for Pidgin? Again, Pidgin 2 and Gaim before it has been a 100% C based application that used autotools for it's build.

Background: I've been contributing to the project since 2003 and leading it since 2016.

@grimmy @pidgin @gabrielesvelto this was a reply to gabriele svelto about firefox

@wyatt @pidgin @gabrielesvelto Sorry, saw that after the fact..

@hyc @vbabka @oleksandr @gabrielesvelto @guenther you have completely missed the point and decided to do some name calling. Classy.

@vbabka @vathpela @oleksandr @gabrielesvelto @guenther eh? Your earlier comment is in direct opposition to his.

@argv_minus_one @ShadSterling @gabrielesvelto It also simply wouldn't work.

The checking code itself could get corrupted.

This is "install byzantine fault tolerant hardware" kind of hard to fix.

@lispi314

If the checking code gets corrupted, it'll probably crash instead of actually going through with the write.

Probably.

Unless you're really unlucky…

@ShadSterling @gabrielesvelto

@lispi314

I'm reminded of how cancer works.

Most of the time, DNA transcription and storage works fine.

Sometimes it doesn't. Most of the times this happens, the cell will detect the error and apoptose.

And if it doesn't, an immune cell will probably eat it.

Failing that, it will probably simply fail to function.

A few errors will cause it to reproduce endlessly, but it'll probably run out of space/resources to grow.

But if it doesn't, you get cancer…

@ShadSterling @gabrielesvelto

@gabrielesvelto Like at a minimum have it date >> ~/.mozilla/firefox/memtest_fail or something, come on!

@purpleidea well, yes, we could definitely do that, or store it so that it can be reached from about:support

@gabrielesvelto if you want some field data: I have a PC with 4 memory slots. When populated with 2x16GB cards overclocked to their XMP speed, all memory tests pass. When populated with 4x16GB cards overclocked to their XMP speed (all the exact same model), memory tests fails consistently on a very high-address bit. I had to slightly under-volt them below their XMP speed to get tests to pass. I would only ever get crashes on games and other heavy-load processes.

@gabrielesvelto my guess is that a lot of people who enable XMP never test their memory and assume it should "just work." My guess is that it's my chipset which isn't good enough to handle all that memory bandwidth, as tests fail on the same address bit regardless of the permutation of RAM boards. So I am not surprised at all that there's plenty of weird memory bugs out in the field, especially for a popular program.

@gabrielesvelto In the old days MS-DOS used to do a quick memory check when loading himem.sys. This used to annoy me because on systems with more than 4MB of RAM, this can take a while. So there's an option to disable it, and in the DOS manual where this setting is, they explain that they had to enable it by default because the very large amount of memory out there is flaky, and lots of issues are caused by it. It seems that not much has improved, ECC should become default these days.
replies
1
announces
0
likes
2

@highvoltage and at that time you could still buy memory with parity! There was still an expectation that client systems would have the option of having at least a check that everything was fine