Mounting tar archives as a filesystem in WebAssembly

ddatajeroen about 5 hours ago 16 commentsRead Article on jeroen.github.io

ES version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

58% Positive

Analyzed from 841 words in the discussion.

Discussion (16 Comments)Read Original on HackerNews

Ecco•about 3 hours ago

How about using a format that has actually been designed to be a compressed read-only filesystem? Something like a SquashFS or cramfs disk image?

johannes1234321•about 2 hours ago

When looking at established file formats, I'd start with zip for that usecase over tarballs. zip has compression and ability to access any file. A tarfule you have to uncompress first.

SquashFS or cramps or such have less tooling, which makes the usage for generating, inspecting, ... more complex.

nrclark•about 1 hour ago

You only have to decompress it first if it's compressed (commonly using gzip, which is shown with the .gz suffix).

Otherwise, you can randomly access any file in a .tar as long as: - the file is seekable/range-addressible - you scan through it and build the file index first, either at runtime or in advance.

Uncompressed .tar is a reasonable choice for this application because the tools to read/write tar files are very standard, the file format is simple and well-documented, and it incurs no computational overhead.

johannes1234321•4 minutes ago

> Uncompressed .tar is a reasonable choice for this application

Yes, uncompressed tar (with transfer compression, which is offered in HTTP) is an option for some amount of data.

Till the point where it isn't. zip has similar benefits as tar(+transfer compression) but a later point where it fails for such a scenario.

electroly•21 minutes ago

You've just constructed your own crappy in-memory zip file, here. If you have to build your own custom index, you're no longer using the standard tools. If you find yourself building indices of tar files, and you control the creation, give yourself a break and use a zip file instead. It has the index built in. Compression is not required when packing files into a zip, if you don't want it.

sixdimensional•15 minutes ago

I second the idea to use a zip. In fact it's what a lot of vendors do because it is so ubiquitous, even Microsoft for example - the "open Microsoft Office XML document format" is just a zip file containing a bunch of folders and XML files.

blipvert•about 1 hour ago

Zip is a piece of cake.

I had need to embed noVNC into an app recently in Golang. Serving files via net/http from the zip file is practically a one-liner (then just a Gorilla websocket to take the place of websockify).

stingraycharles•about 2 hours ago

Sometimes (read: very often) you can’t choose the format. Obviously if squashfs is available that is a better solution.

sillysaurusx•about 4 hours ago

Only peripherally relevant, but also see Ratarmount: https://github.com/mxmlnkn/ratarmount

It lets you mount .tar files as a read only filesystem.

It’s cool because you basically get random access to the tarball without paying any decompression costs. (It builds an index saying exactly where so-and-so is for every file.)

Lerc•about 1 hour ago

I did some similar shenanigans when I did a silly little system on NeoCities https://lerc.neocities.org/

It uses IndexedDB for the filesystem.

Rather Dumbly it is loading the files from a tar archive that is encoded into a PNG because tar files are one of the forbidden file formats.

phiresky•about 2 hours ago

I'm a bit disappointed that this only solves the "find index of file in tar" problem, but not at all the "partially read a tar.gz" file problem. So really you're still reading the whole file into memory, so why not just extract the files properly while you are doing that? Takes the same amount of time (O(n)) and less memory.

The gzip-random-access problem one is a lot more difficult because the gzip has internal state. But in any case, solutions exist! Apparently the internal state is only 32kB, so if you save this at 1MB offsets, you can reduce the amount of data you need to decompress for one file access to a constant. https://github.com/mxmlnkn/ratarmount does this, apparently using https://github.com/pauldmccarthy/indexed_gzip internally. zlib even has an example of this method in its own source tree: https://github.com/gcc-mirror/gcc/blob/master/zlib/examples/...

All depends on the use case of course. Seems like the author here has a pretty specific one - though I still don't see what the advantage of this is vs extracting in JS and adding all files individually to memfs. "Without any copying" doesn't really make sense because the only difference is copying ONE 1MB tar blob into a Uint8Array vs 1000 1kB file blobs

One very valid constraint the author makes is not being able to touch the source file. If you can do that, there's of course a thousand better solutions to all this - like using zip, which compresses each file individually and always has a central index at the end.

ImJasonH•about 1 hour ago

The first time I'd heard of this was via https://github.com/jonjohnsonjr/dagdotdev/blob/main/internal... which powers https://oci.dag.dev to let you browse OCI images (e.g., https://oci.dag.dev/fs/ubuntu@sha256:b40150c1c2717d324cdb172...)

vlovich123•about 1 hour ago

> Each seek point is accompanied by a chunk (32KB) of uncompressed data which is used to initialise the decompression algorithm, allowing us to start reading from any seek point.

> Apparently the internal state is only 32kB, so if you save this at 1MB offsets, you can reduce the amount of data you need to decompress for one file access to a constant.

You may need to revisit the definition of a constant. A 1/32 additional data is small but it still grows the more data you’re trying to process (we call that O(n) not O(1)). Specifically it’s 3% and so you generally want to target 1% for this kind of stuff (once every 3mib)

And the process still has to read though the enter gzip once to build that index

phiresky•44 minutes ago

I think you're looking at a different perspective than me. At _build time_ you need to process O(n), yes, and generate O(n) additional data. But I said "The amount of data you need to decompress is a constant". At _read time_, you need to do exactly three steps:

1. Load the file index - this one scales with the number of files unless you do something else smart and get it down to O(log(n)). This gives you an offset into the file. *That same offset/32 is an offset into your gzip index.*

2. take that offset, load 32kB into memory (constant - does not change by number of files, total size, or anything else apart from the actual file you are looking at)

3. decompress a 1MB chunk (or more if necessary)

So yes, it's a constant.

haunter•about 1 hour ago

Now I want to try how does that work with BTFS which in a similar vein mounts a torrent file or magnet link as a read only directory https://github.com/johang/btfs