Corrupting a ZFS File on Purpose

zzdw 3 days ago 19 commentsRead Article on oshogbo.com

ZH version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

54% Positive

Analyzed from 1254 words in the discussion.

Discussion (19 Comments)Read Original on HackerNews

ralferoo•3 days ago

Hmmm, it's been a long long time since I actually had a failed drive (and also I don't use zfs), but from what I remember of my last failing drive 20 years ago, the drive was able to detect that sectors had been corrupted, and then failed the read rather than just returning silently corrupted data. If my memory is correct, replacing random bytes on disk wouldn't actually reflect the typical way data corruption manifests itself.

I always thought that the reason zfs did its extensive CRC checks was primarily to detect data corruption while it was in RAM or over the network, with a side effect that in the rare cares that data on disk got corrupted without the drive detecting it because the CRC was still valid, it'd also be spotted.

But anyway, it might be worth testing by replacing some of the disk images with actually truncated ones so that there are holes when reading, so that it returns an actual read error rather than junk data.

adrian_b•3 days ago

The error-correcting codes used by HDDs/SSDs correct or detect the most frequent errors, but sometimes, when there are too many erroneous bits in a sector, they can mis-correct the data and then the HDD/SSD returns a corrupted sector without signaling any error.

I have seen this a few times on HDDs that had been used for the cold storage of archival data, for several years (around 5 years or even more). For each archive file, I had my own hash values that were used to detect corrupted files, which allowed me to detect all such cases. I had duplicates for all such HDDs. Sometimes both HDD copies had a few silent corrupted sectors, but they were not in the same locations, so in all cases I could recover the corrupted files from their duplicates. If I had stored the archival data without redundancy, I would have lost it.

If you do not use hashes or other error-detecting codes for all your files, like I do, you may have had some failures in your HDDs without recognizing them, but such errors are much more likely to happen in files that have been stored for many years.

ramses0•2 days ago

And/Or: `*.par` files.

https://en.wikipedia.org/wiki/Parchive

adrian_b•about 7 hours ago

Yes, already for many years, I have also used par2create/par2verify for adding redundancy to archive files and repairing any corrupted files.

However, I use both par2create and duplicate storage media, because duplicates that are preferably stored in different geographic locations are the only solution that guards against incidents so serious that they would destroy partially or totally the storage device.

By itself, when an adequate amount of added redundancy is chosen, par2create is sufficient to recover archive files that are only affected by a few sporadic corrupted sectors, like on a HDD that has been stored in good conditions for some years. It will not help if the entire HDD becomes unusable, due to some mechanical or electrical defect, which may happen in HDDs used for cold storage, instead of being used continuously.

wongarsu•about 8 hours ago

Or rar files with recovery records. Same concept, but in one self-contained file instead of a number of sidecar files

throw0101c•about 9 hours ago

> I always thought that the reason zfs did its extensive CRC checks was primarily to detect data corruption while it was in RAM or over the network, with a side effect that in the rare cares that data on disk got corrupted without the drive detecting it because the CRC was still valid, it'd also be spotted.

Nope, it's always been about on-disk bit rot.

First off: drive firmware has been known to return the wrong LBA data. The OS asks for 123, the drive reads 234—and verifies its drive-level CRC, which passes—and sends it up. Application gets a bundle of bits that's not correct. With ZFS, it expects a certain checksum from that part of the tree/file, and so the LBA 234 gets returned it will not match the checksum that is for 123.

Next, if you have RAID-1, then if the drive has corrupted data, if you don't have higher-level FS checksums, how do you which mirror has the correct data? They're different, but which is correct. With ZFS you know which block has the correct checksum, return that data to application, and then use the correct data to correct the wrong one.

BuildTheRobots•about 5 hours ago

I don't know how much better modern drives (and SSDs) have gotten[1], but as someone who started digital hoarding in the mid 90's, on-disk bitrot used to be a massive problem. The amount of my video, audio and pictures that suffered damage was palpable. ZFS offering to fix it was massive selling point and the time and based on personal experience, it delivered.

ZFS also lets you specify number of copies on a single disk. This sounds a bit weird, but as drives suffer block failures far more often than total failures, it's actually surprisingly useful in some situations.

[1] My suspicion is significantly, as storage sizes are now multiple orders of magnitude larger and errors per MB can't have scaled up linearly to match.

ssl-3•about 2 hours ago

> Hmmm, it's been a long long time since I actually had a failed drive (and also I don't use zfs), but from what I remember of my last failing drive 20 years ago, the drive was able to detect that sectors had been corrupted, and then failed the read rather than just returning silently corrupted data.

That's the behavior that is desired, yes. And in a neat world of frictionless pulleys and ropes that don't stretch, perhaps that is what happens.

In reality, the root reasoning for filesystems to detect bitrot is simpler: It's irrational to expect that a device which is already failing is going to behave in a predictable way.

matja•2 days ago

You're right that the ECC validation is very robust, but that only validates one small part - that the drive is reading what it has previously written, not that the data was correct when it came in to the drive, correctly handled by the firmware, or even written in the correct place (LBA) on the drive.

There's been times when some features of entire models of drives have been disabled in the Linux kernel because of buggy firmware that silently writes bad data (with correct ECC), so reading it back is successful from both the drive's and the OS's block driver views.

I was hit by this myself with the queued TRIM command firmware bug that affected all Samsung EVO 840 SSDs (Linux kernel commit 9a9324d3969678d44b330e1230ad2c8ae67acf81 if you want to look into the history) - the drive didn't report any errors, but ZFS kept reporting corruption, and kept on fixing it in the background.

guardiangod•about 6 hours ago

I ran 5 external USB + SMR hard disks in RAIDZ 5 for 10 years. The only thing I had to change was to use Highpoint's enterprise level USB controllers- commercial USB controllers from Realtek and Renasus are junk and will drop the drives after a while.

Even then, I had multiple cases where files were corrupted, and once the whole array refused to be online due to corrupted metadata. I had to make ZFS to replay the journal log with undocumented commands. Sometimes it takes a few days of hair-rising recovery but I always manage to get the array back intact.

The files that are corrupted are always extremely large files (>50 GB) with many small read/writes (eg. iSCSI image files.)

It's pretty impressive how resilient ZFS is, really, given I had what likely to be the worst possible hardware combination.

BuildTheRobots•about 5 hours ago

Out of curiosity, why were you using a >50gb file on a dataset as as iSCSI target rather than a zvol or did I misunderstand?

xk3•about 2 hours ago

I've done something similar before with Btrfs

https://gist.github.com/chapmanjacobd/bc6e31c8bc3647e0bcb0c4...

pretty fun!

anonymous_user9•3 days ago

> The DVA was correct, the sector math was correct, the dd command was correct. The right place, the wrong mental model.

God the intensity is tiresome. Whether or not it's AI slop, it's also bad writing. Things can be fun or interesting or worthwhile without being a harrowing battle of discovery!

calcifer•about 7 hours ago

> Things can be fun or interesting or worthwhile without being a harrowing battle of discovery!

The quoted sentences used "correct", "right" and "wrong". Hardly the sensationalist words you're implying.

rcxdude•29 minutes ago

It's not the word choice, it's the whole tone and structure of the sentence. It reads like a horror writer building up the tension before a big reveal but it just keeps drawing it out over a whole article and for something that isn't worth the build-up. It gets quite tiring to read IMO (LLM writing in general tends to have a grandiosity to it which really grates with something which is meant to be more informative, in my experience. They will explain a section of tax law like it's the second coming of Christ).

eigencoder•about 6 hours ago

I get what they're saying, it's more about the punchy tone than the word choice

xeyownt•about 5 hours ago

Nice writeup. This is this kind of exercise and information that really help you understand better how things work.

lanycrost•about 12 hours ago

I miss ZFS, only had a chance once to work with it in production and liked it very much. It's have performance overhead compared to journal filesystems but greatly designed.

igtztorrero•about 8 hours ago

I always run my servers on zfs pool mirrored using raid1 on 2 nvme drives, because when nvme fails, fail completely. How can a File be corrupted on normal operations?