Removing fsync from our local storage engine

zzzsheng 2 days ago 26 commentsRead Article on fractalbits.com

⚡ Community Insights

Discussion Sentiment

87% Positive

Analyzed from 1005 words in the discussion.

Discussion (26 Comments)Read Original on HackerNews

sethev•23 minutes ago

This seems sketchy. O_DIRECT skips the operating system's page cache, it does not guarantee that the SSD driver sent the data to the SSD or issued a flush to the drive itself. The data could still be in the driver's memory or the in non-durable memory in the drive itself when this engine says "ok, we're good".

EDIT: sketchy from an answering "what exactly are the guarantees?" perspective

nh2•about 2 hours ago

> fsync doesn’t just sync the file’s data, it syncs every piece of metadata the file depends on: ... directory entry

Famously not, as the man page says.

It is also said later in the article:

> POSIX strictly requires a parent-directory fsync to make a newly created file’s existence durable.

So I'm not sure why the dirent sync is claimed earlier.

thomas_fa•32 minutes ago

Thanks for pointing it out the mistakes. We should make it clearer, when fsync an opened file descriptor, it would only sync its own metadata. To make it truly persistent, we need to issue another fsync for the directory fd, which would make it more expensive.

matja•about 2 hours ago

Even with O_DIRECT and aligned blocks, I still don't understand how the storage engine can return a "successful commit" to the client without a sync at some point, because a sync (IIRC) is the only way to guarantee an ATA/NVMe FUA command is sent, and the device write cache/buffer is committed.

klodolph•about 1 hour ago

:-/ it’s a statistical guarantee in the first place. A successful commit in a durable storage engine just needs to achieve some finite level of durability, like “10^-7 probability of loss per year”. The durability is a property of the whole system, and it is possible to achieve durability without fsync, you just may have a hard time explaining what the durability is, how you calculated it, and what the evidence or justifications are for the numbers you give.

Even if you just look at hardware failure rates, you get unrecoverable I/O errors (data corruption) at about one in 10^15 bits, disk failures at a rate of about 1% per year, etc. People usually like to have better guarantees than those numbers give you with just a plain fsync anyway; so you are probably forced to do an analysis of the whole system if you want to provide good durability guarantees and be able to explain where the guarantees come from.

asdfasgasdgasdg•34 minutes ago

10^-7 (loss/record) * 10^8 (record/year) yields 10 data losses per year. If you're even a medium sized business you need a much better than 10^-7 probability of losses.

jakewins•29 minutes ago

I used to say this as well but like.. industry has, for a long time now equated “durable” with “stored on disk”. Any DBA will assume that’s what it means, and use that fact when they work out the replication they need either in clustering or in raid.

If you’re building a data storage system and are using the term “durable” to mean “it’s in RAM on three virtual machines”, for example, I don’t think it’s unfair to say that you are lying to your customers, because you are intentionally misusing a well-established term.

thomas_fa•25 minutes ago

Yes, as we mentioned in the post, it is targeted for the virtualized NVME disk and we don't have control for actually issing FUA command. We are also changing to open data files with O_DATA_SYNC to make them work with normal on-prem deployment environments.

binaryturtle•about 1 hour ago

To truly guarantee things you probably also would need an uncached read afterwards (to verify the data comes back properly from the device). Now that would kill any sort of performance, of course.

asdfasgasdgasdg•30 minutes ago

There is no such thing as a guarantee in life, there are only probabilities. The goal is to make it sufficiently unlikely that data is lost, and to balance that against the cost of doing so.

That is where the disparity lies here. Reading back the data after the device reports that it has been written offers little in the way of additional assurances that it's successfully written. But if you report successful writes without syncing, there is a near certainty that you'll lose data on every power loss.

myself248•about 2 hours ago

To step back a bit, the device still has a filesystem on it, and the structures described here are files within the filesystem? Just you're able to write directly into them, bypassing the filesystem layer, because you've constrained yourself to writes that don't require updating other parts of the filesystem structure?

thomas_fa•11 minutes ago

Yes, that's right. We could go even further, to use the raw devices without relying on any filesystem. We then need to allocate/format raw disk spaces and we can not just open files as simple as right now. It would take some extra effort, but we would like to explore that in the future.

It will also make the system initialization faster, since right now we need to write all zeros to make ext4/xfs to actually initialize extents as "allocated".

zzsheng•2 days ago

Author here. This is not a general argument against fsync; the design depends on SSD-only deployment, preallocated files, O_DIRECT, single-key atomicity, and device write guarantees.

100ms•about 2 hours ago

Your approach looks interesting but I was curious when you talk about path-based splitting for ART, do you literally mean always on "/"? I know S3 directory buckets always use /, but the classical S3 model had no natural separator character and I was wondering if supporting those styles of prefix or custom delimiter queries suffered any impediment in your approach.

Bookmarked your whole blog for later consumption, interesting stuff!

thomas_fa•4 minutes ago

Thanks for the encouragement! Another author here. Yes, if you are interested you can check our another blog [1] for the internal storage engine. Yes, we are limiting the delimeter to "/", to better support posix FS semantics. I have just finished the fs feature branch which has passed all posix fstests [2].

[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...

[2] https://github.com/pjd/pjdfstest

seastarer•27 minutes ago

It's more correct to use O_DSYNC in addition to O_DIRECT. This adds FUA to the disk write if the disk requires it for durability.

thomas_fa•17 minutes ago

Yes, that has also been pointed out in other threads. Yes this could be very important settings, and even some of common Linux file systems actually don't do that every time and we need to disable the disk writecache during boot up to make sure the data truly persistent (as in my previous storage company).

alexhnn•2 days ago

Working with files is hard [1], and most of the complicity is from the fsync API. I am glad it can be eliminated from a kv storage engine.

[1] https://news.ycombinator.com/item?id=42805425

bawolff•about 1 hour ago

Am i understanding correctly that you are just targeting consistency and not durability?

7e•about 1 hour ago

This is really great work. Kudos to the team for such an elegant solution.

thomas_fa•2 minutes ago

Thanks for the kind words! You check more of our work in https://github.com/fractalbits-labs/fractalbits.

dboreham•about 3 hours ago

Almost full-circle back to when Oracle took over the entire volume and implemented its own filesystem.

dale_glass•about 2 hours ago

I wonder why this is not more common. LVM is easy to set up, and it's already common to allocate volumes for things like disk images for VMs, so why not databases?

jandrewrogers•11 minutes ago

Some Linux filesystems, notably ext4 and XFS, provide the necessary features to get 90% of the benefit simply by using O_DIRECT correctly. The last 10% is achieved by doing direct I/O to raw block devices, with the obvious caveat that this is not as easy to manage.

Both of these are commonly done in database storage engines.

tptacek•about 1 hour ago

If you preallocate and O_DIRECT, haven't you basically soaked up most of the benefit of skipping the filesystem?

pizza234•about 2 hours ago

Because the speed increase is - on modern, properly tuned filesystems - surprisingly small, due to how RDBMS's manage their pool; by working on large container files, they avoid most of the filesystem overhead.