Replacing a 3 GB SQLite db with a 10 MB FST (finite state transducer) binary

Hendrikto•about 2 hours ago

> I do wish to point out, of course, that the whole reason it was possible to experiment cheaply and come across this serendipity was because 9 months ago, faced with the choice to either do the bad easy thing or the good nothing, I chose to do the bad easy thing.5 The SQLite database worked! I understood how it worked, behind the scenes with its B-trees and its Full Text Search extension.

This is the most important takeaway, imo, and a very valuable technique: Start with the obvious, stupid solution that definitely works. Then do the optimized version, while making sure it matches the naive implementation. In this case, the optimized version could even be generated from the naive one.

simmonmt•39 minutes ago

It jumped out at me too, but because I wondered what it would look like in the AI version of this story. Having had it build the SQL version do you ... a) miss the leap because you don't understand how it works, don't care to know, and go off to vibe the next thing b) ask it lots of questions because reasons to develop that deep understanding then make the leap or c) rely on it (prompt: "this can't be good enough do better") to go make the leap for you.

(Assuming for the sake of argument that you guided it to the SQL version first)

embedding-shape•29 minutes ago

Depends on what your overall goal is with what you're building. Is it to rush out as many features as you possible can, before VC-funding falls through the floor so you too can get a slice of the pie before the party is over? Or are you "retired" in your 30s and now have time to build the perfect software for you? Do you need to publish and release an experiment to see how people react to it or use it, before you can know if it's the right thing or not?

Almost everything needs to be contextualized before you can even begin to answer what the right way forward is, depends so heavily on what situation you're in.

baublet•about 2 hours ago

Came here to add this, too. Sometimes the most valuable thing a solution can buy you is time to think of a better solution.

chrisweekly•about 1 hour ago

Yes - tho I'd refine it to say it's not just more time, it's time plus the forcing function of a working solution requiring more deeply understanding the problem space.

lscharen•about 2 hours ago

I was halfway through the article and began thinking that his described data structure sounded very familiar to something I used about 20 years ago.

Sure enough, the first paragraph on the Wikipedia entry for DAFSA is:

DAFSA is the rediscovery of a data structure called Directed Acyclic Word Graph (DAWG)

hiAndrewQuinn•about 1 hour ago

Apparently the structure itself has a bit of a history. The word 'rediscovery' tipped me off to go to Wikipedia myself and read up more about this.

First Blumer et al., 1983 came up with a "DAWG", but reading the abstract [1] I was left a little confused as to how exactly we get from 'here is how we store all substrings of a string in O(|string|) space, with "is this a substring [yn]" recognition in O(|substring|) time' to the modern DAFSA, as cool and useful as that is. Come to think of it I bet I could use that in some LeetCode problems.

But the structure we actually think of as a DAWG or DAFSA (or FST, I guess, thanks to this Rust crate) is in the paper "The World’s Fastest Scrabble Program". That worked but you had to construct a whole trie first, then compact it down, so build time was a memory hog. Then Dr. Daciuk of 3city sharpened the blade in 2000 by proving that this was as good as it gets in the unsorted case, but on a sorted set you could build the DAFSA incrementally, because an increasingly large part of the graph you were building was already optimized.

And then from there BurntSushi got involved with the implementation and the rest is history.

[1]: https://www.sciencedirect.com/science/article/pii/0304397585...

(That's what I can glean from ~30 minutes of not particularly focused reading. Forgive me if I have made any mistakes.)

carefulfungi•about 1 hour ago

I recently learned about a related data structure designed for scrabble style word searches: https://en.wikipedia.org/wiki/GADDAG.

asibahi•about 1 hour ago

This was a very interesting read. I wonder if similar techniques can apply to Turkish or Japanese dictionaries?

Replacing a 3 GB SQLite db with a 10 MB FST (finite state transducer) binary

⚡ Community Insights

Discussion (9 Comments)Read Original on HackerNews