DE version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
73% Positive
Analyzed from 10437 words in the discussion.
Trending Topics
#database#data#sqlite#need#file#files#don#databases#using#something

Discussion (280 Comments)Read Original on HackerNews
There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."
One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.
So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.
If you need scale, cache the most often accessed data in memory and you have the best of both worlds.
My winning combo is sqlite + in-memory cache.
If your downtime does not cost much, you can host many things on a single tiny computer.
I'm also a convert.
I can't talk though because I actually find myself doing this a lot
"Every database you have ever used reads and writes to the filesystem, exactly like your code does when it calls open()."
Technically not true. Applications like SQLite use mmap to map the file into a locally addressable memory space. This lets you skip the syscalls when reading and writing. The kernel can map that data in dynamically much faster than a userland process can.
Later in the article they go over the process of reading in the entire file into memory, again mmap is much better at this. Would have been nice to see that approach used.
That said, depending who you are talking to, they may not agree with you on "mmap is much better than this". Some people will say you should do what you need in the application logic instead of depending on APIs from the OS. (although not necessarily for the specific example here)
https://db.cs.cmu.edu/mmap-cidr2022/
Not necessarily. The kernel's mmap implementation has quite a strong bias towards certains kinds of access patterns; deviate from them and it can become slower than read(2).
We tried using mmap(2) for audio file i/o in Ardour, and it got notably less bandwidth than just using read(2).
As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).
The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.
The very small bonus you get on small apps is hardly worth the time you spend redeveloping the wheel.
During Cretaceous, when dinosaurs were at their peak, sharks had already become very similar to the sharks of today, e.g. there were big sharks that differed very little from the white sharks and tiger sharks of today.
Then the dinosaurs have disappeared, together with the pterosaurs and the mosasaurs, and they have been replaced by other animals, but the sharks have continued to live until today with little changes, because they had already reached an optimized design that was hard to improve.
Besides the sharks, during Cretaceous there already existed along the dinosaurs other 2 groups of big predators that have changed little since then, crocodiles and big constrictor snakes similar to the pythons of today.
Therefore all 3 (sharks, crocodiles and big constrictor snakes) are examples of locally optimum designs that have been reached more than 70 million years ago, without needing after that any major upgrades.
I was building a client side dictionary app with search functionality and thought that using sqlite's wasm port would be the perfect solution.
I used SQLite for a couple years and it started to wear on me. The database files were bigger than they should be, they didn't compress super well, the loading time was a bit slow. Linear search wasn't very fast either. Editing SQLite files and doing joins and stuff was slow and frustrating as well. I wanted something simpler.
I didn't need atomics, I didn't need writes, so I just handrolled a set of indexes from the source tsv files and compressed those with zstd and decompressed them via wasm on every load. Decompressing and loading was now faster than straight loading via SQLite and since my module was 52kb of wasm instead of 800kb for SQLite I could load as many instances of the module as I wanted with no regrets. I use stringzilla for linear scans and it's ridiculously fast.
SQLite is great, but it's not the solution for every problem.
It's a little out of the scope of stringzilla but I should still be able to use it for the heavy lifting
Just adding:
Made the performance on my machine go from 27,700.99 r/s to 89,687.36 r/s.I also tried making the get user a prepared statement, and storing the timestamp as an unix timestamp integer, but that didn't make much difference for me.
You'll likely end up quite a chump if you follow this logic.
sqlite has pretty strong durability and consistency mechanism that their toy disk binary search doesn't have.
(And it is just a toy. It waves away the maintenance of the index, for god's sake, which is almost the entire issue with indexes!)
Typically, people need to change things over time as well, without losing all their data, so backwards compatibility and other aspects of flexibility that sqlite has are likely to matter too.
I think once you move beyond a single file read/written atomically, you might as well go straight to sqlite (or other db) rather than write your own really crappy db.
However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.
And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?
Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.
This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.
"Why use spray paint when you can achieve the same effect by ejecting paint from your mouth in a uniform high-velocity mist?" If you happen to have developed that particular weird skill, by all means use it, but if you haven't, don't start now.
That probably sounds soft and lazy. I should learn to use my operating system's filesystem APIs safely. It would make me a better person. But honestly, I think that's a very niche skill these days, and you should consider if you really need it now and if you'll ever benefit from it in the future.
Also, even if you do it right, the people who inherit your code probably won't develop the same skills. They'll tell their boss it's impossibly dangerous to make any changes, and they'll replace it with a database.
This, 100%. Development today is driven by appearances though, you can take advantage of that. Give it a cute name, make sure you have AI generate an emoji-rich README for it, publish it as an open source npm package, then trigger CI a few thousand times to get a pretty download count. They will happily continue using it without fear!
If you start a new job and on your first day they go "Yeah the last guy said we don't need a database, so he rolled his own." are you gonna be excited, or sweating?
Exception being perhaps "The last team chose to build their own data layer, and here's the decision log and architecture docs proving why it was needed."
If that kind of filesystem traffic is unsuitable for your application then you will reinvent journaling or write-ahead logging. And if you want those to be fast you'll implement checkpointing and indexes.
At least write to a temp file(in the same filesystem), fsync the file and its folder and rename it over the original.
Doesn't scale at all, though - all of the data that needs to be self-consistent needs to be part of the same file, so unnecessary writes go through the roof if you're only doing small updates on a giant file. Still gotta handle locking if there is risk of a stray process messing it up. And doing this only handles part of ACID.
I believe syncing the parent directory is only needed for durability and not atomicity, but if you're using this method you're probably not caring about durability in the first place, because you've designed a system that doesn't let you do durability in a fine-grained way without severe performance loss.
This is false, but most fs will. However, there is a lot of fs calls you have to make that you probably don't know about to make the fs operations atomic.
PS The way you propose is probably the hardest way to do an atomic FS operation. It will have the highest probably of failure and have the longest period of operations and service disruption. There is good reason we move rows one at a time or in batches sized to match OS buffers.
https://docs.kernel.org/filesystems/ext4/atomic_writes.html
They are overoptimising for the simplest part of writing the application; the beginning. They've half-implemented an actual database, with none of the safety features. There are a lot of potential headaches that this article has avoided talking about; perhaps because they haven't experienced them yet.
See: https://danluu.com/file-consistency/
What happens when you need to start expanding the scope of this feature? Joining users on profiles, or users on orgs?
Ask yourself: how many shops have seriously written an application backed by files and stuck with it over the long-run? The answer is likely very few. Therefore, this is likely doubling up the work required.
There is a reason people reach for a database first. I'd strongly encourage anyone to avoid doing stuff like this.
yet, for a generic app or server, just don't fuck your brains and go with SQLite
a jsonl file and a single go binary will literally outlive most startup runways.
also, the irony of a database gui company writing a post about how you dont actually need a database is pretty based.
But yeah, the page cache point is real and massively underappreciated. Modern infrastructure discourse skips past it almost entirely. A warm NVMe-backed file with the OS doing the caching is genuinely fast enough for most early-stage products.
weve basically been brainwashed to think we need kubernetes and 3 different databases just to serve a few thousand users. gotta burn those startup cloud credits somehow i guess.
mad respect for the honesty though, actually makes me want to check out db pro when i finally outgrow my flat files.
Similar sentiment.
I don't think it makes any sense to presume everyone around you is brainwashed and you are the only soul cursed with reasoning powers. Might it be possible that "we" are actually able to analyse tradeoffs and understand the value of, say, have complete control over deployments with out of the box support for things like deployment history, observability, rollback control, and infrastructure as code?
Or is it brainwashing?
Let's put your claim to the test. If you believe only brainwashed people could see value in things like SQLite or Kubernetes, what do you believe are reasonable choices for production environments?
In practice, I almost always separate the auth chain from the service chain(s) in that if auth gets kicked over under a DDoS, at least already authenticated users stand a chance of still being able to use the apps. I've also designed auth system essentially abstracted to key/value storage with adapters for differing databases (including SQLite) for deployments...
Would be interested to see how LevelDB might perform for your testing case, in that it seems to be a decent option for how your example is using data.
And worse, overestimate how safe is their data!
All this fancy thing about not using a RDBMS could had been true only if the APIs and actual implementation across ALL the IO path were robust and RELIABLE.
But is not!
EVERY LAYER LIES
ALL of them
ALL OF TIME
That is why the biggest reason building a real database (whatever the flavor) is that there is no way to avoid pay performance taxes all over the place because you can't believe the IO and having a (single | some files) getting hammered over and over make this painfully obvious.
One of the most sobering experiences is that you write your IO with all the care in the world, let the (your brand new) DB run for hours, on good, great, hardware, and in less than a week you will find that that breaks in funny ways.
P.D: Was part of a team doing a db
I'm pretty sure most startups just use a quick and easy CRM that makes this process easy, and that tool will certainly use a database.
There are also things besides databases that I'll DIY and then still wonder why so many people use a premade tool for it, like log4j
I suggest every developer learn how to replicate, backup and restore the very database they are excited about, from scratch at least once. I propose this will teach them what takes to build a production ready system and gain some appreciation for other ways of managing state.
That's why I was encouraging people to make backups and restores of their DB because personally, it has made me appreciate why different designs exist and when to use them vs just slapping a DB connection to an RDS instance and calling it a day.
Total hosting costs are £0 ($0) other than the domain name.
I can use databases just fine, but will never be able to make wise decisions about table layouts, ORMs, migrations, backups, scaling.
I don't understand the culture of "oh we need to use this tool because that's what professionals use" when the team doesn't have the knowledge or discipline to do it right and the scale doesn't justify the complexity.
Well, I guess that at least confirms Oracle on Itanium (!?) still supported RAW 5 years ago.
I'm guessing everyone's on ASM by now though, if they're still upgrading. I ran into a company not long ago with a huge oracle cluster that still employed physical database admins and logical database admins as separate roles...I would bet they're still paying millions for an out of date version of Oracle and using RAW.
I seem to remember Oracle 10g was first released over 20 years ago? It has been EOL for much longer than 5 years...
Since they’re using Go to accept requests and forwarding them to their SQLite connection, it may have been worthwhile to produce the same interface with Rust to demonstrate whether or not SQLite itself was hitting its performance limit or if Go had some hand in that.
Other than that, it’s a good demonstration of how a custom solution for a lightweight task can pay off. Keep it simple but don’t reinvent the wheel if the needs are very general.
It would be much better to write all of this as binary data, omitting separators.
• Since it’s fixed-width and simple, inspecting the data is still pretty easy—there are tools for working with binary data of declared schema, or you could write a few-liner to convert it yourself. You don’t lose much by departing ASCII.
• You might want to complicate it a little by writing a version tag at the start of the file or outside it so you can change the format more easily (e.g. if you ever add a third column). I will admit the explicit separators do make that easier. You can also leave that for later, it probably won’t hurt.
• UUID: 36 bytes → 16 bytes.
• Offset: 20 bytes (zero-padded base-ten integer) → 8 bytes.
• It removes one type of error altogether: now all bit patterns are syntactically valid.
• It’ll use less disk space, be cheaper to read, be cheaper to write, and probably take less code.
I also want to register alarm at the sample code given for func FindUserBinarySearch. To begin with, despite a return type of (*User, error), it always returns nil error—it swallows all I/O errors and ignores JSON decode errors. Then:
That strings.TrimRight will only do anything if your data is corrupted. Not important when you control the writing, but worth noting that UUID string comparison is case-insensitive. Superfluous. ParseInt doesn’t mind leading zeroes, and it’ll probably skip them faster than a separate TrimLeft call. That’s begging to make data corruption difficult to debug. Most corruption will now become dataOffset 0. Congratulations! You are now root.I think a better way to ask this question is "does this application and its constraints necessitate a database? And if so, which database is the correct tool for this context?"
I will still reach for a database 99% or the time, because I like things like SQL and transactions. However, I've recently been working on a 100% personal project to manage some private data; extracting insights, graphing trends, etc. It's not high volume data, so I decided to use just the file system, with data backed at yaml files, with some simple indexing, and I haven't run into any performance issues yet. I probably never will at my scale and volume.
In this particular case having something that was human readable, and more importantly diffable, was more valuable to me than outright performance.
Having said that, I will still gladly reach for a database with a query language and all the guarantees that comes with 99% of the time.
What the world needs is a hybrid - database ACID/transaction semantics with the ability to cd/mv/cp file-like objects.
From that POV I moved from the extreme of "you don't need a state at all (and hence no database)" to the bit more moderate "you usually don't need a file or a DB but you almost certainly will want to query whatever state you store so just get a DB early".
I strongly sympathize with "no microservices" of course. That's an overkill for at least 99% of all projects I've ever seen (was a contractor for a long time). But state + querying is an emergent property as a lot of projects move beyond the prototype phase.
Different languages and stdlib methods can often spend time doing unexpected things that makes what looks like apples-to-apples comparisons not quite equivalent
* Not always super easy
I don't choose a DB over a flat file for its speed. I choose a DB for the consistent interface and redundancy.
Also notable mention for JSON5 which supports comments!: https://json5.org/
Memory of course, as you wrote, also seems reasonable in many cases.
In practice, the records themselves took no less than 30 joins for a flat view of the record data that was needed for a single view of what could/should have been one somewhat denormalied record in practice. In the early 2010's that meant the main database was often kicked over under load, and it took a lot of effort to add in appropriate caching and the search db, that wound up handling most of the load on a smaller server.
Most software is stateful and needs to persist state across restarts, so I would argue that one needs at least SQLite.
On SQLite being safe default: in practice it means supporting multiple databases, say SQLite and Postgres, this is more complicated that supporting just Postgres. As soon as a project leaves localhost and enters cloud development you need talk to a database over network, which warrants MySQL or Postgres.
Which is more complicated: supporting a docker container with mysql or Postgres for local development OR supporting multiple databases in the project?
Of course, the answer could be “it depends”, I but I would not call SQLite a default choice. It would be if you are writing desktop or mobile app, but for anything like a web app it’s a questionable choice.
They were terrific reads; his writing on object-oriented databases was the most fun technical reading I did in grad school. And I even learned a lot!
http://www.redbook.io/
If you need to ever update a single byte in your data, please USE A PROPER DATABASE, databases does a lot of fancy thing to ensure you are not going to corrupt/broke your data on disk among other safety things.
One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.
There is also cases for having both, and there is cases for graph-oriented databases or even columnar-oriented ones such as duckdb.
Seems to me, with my very limited experience of course, everything leads to same boring fundamental issue: Rarely the issue lays on infrastructure, and is mostly bad design decisions and poor domain knowledge. Realistic, how many times the bottleneck is indeed the type of database versus the quality of the code and the imlementation of the system design?
NoSQL gains you no speed at all in redesigning your system. Instead, you trade a few hard to do tasks in data migration into an unsurmountable mess of data inconsistency bugs that you'll never actually get into the end of.
> is mostly bad design decisions and poor domain knowledge
Yes, using NoSQL to avoid data migrations is a bad design decision. Usually created by poor general knowledge.
A lot of the bespoke no-sql data stores really started to come to the forefront around 2010 or so. At that time, having 8 cpu cores and 10k rpm SAS spinning drives was a high end server. Today, we have well over 100 cores, with TBs of RAM and PCIe Gen 4/5 NVME storage (u.x) that is thousands of times faster and has a total cost lower than the servers from 2010 or so that your average laptop can outclass today.
You can vertically scale a traditional RDBMS like PostgreSQL to an extreme degree... Not to mention utilizing features like JSONB where you can have denormalized tables within a structured world. This makes it even harder to really justify using NoSQL/NewSQL databases. The main bottlenecks are easier to overcome if you relax normalization where necessary.
There's also the consideration of specialized databases or alternative databases where data is echo'd to for the purposes of logging, metrics or reporting. Not to mention, certain layers of appropriate caching, which can still be less complex than some multi-database approaches.
You could also consider renting an Oracle DB. Yep! Consider some unintuitive facts:
• It can be cheaper to use Oracle than MongoDB. There are companies that have migrated away from Mongo to Oracle to save money. This idea violates some of HN's most sacred memes, but there you go. Cloud databases are things you always pay for, even if they're based on open source code.
• Oracle supports NoSQL features including the MongoDB protocol. You can use the Mongo GUI tools to view and edit your data. Starting with NoSQL is very easy as a consequence.
• But... it also has "JSON duality views". You start with a collection of JSON documents and the database not only works out your JSON schemas through data entropy analysis, but can also refactor your documents into relational tables behind the scenes whilst preserving the JSON/REST oriented view e.g. with optimistic locking using etags. Queries on JSON DVs become SQL queries that join tables behind the scenes so you get the benefits of both NoSQL and SQL worlds (i.e. updating a sub-object in one place updates it in all places cheaply).
• If your startup has viral growth you won't have db scaling issues because Oracle DBs scale horizontally, and have a bunch of other neat performance tricks like automatically adding indexes you forgot you needed, you can materialize views, there are high performance transactional message queues etc.
So you get a nice smooth scale-up and transition from ad hoc "stuff some json into the db and hope for the best" to well typed data with schemas and properly normalized forms that benefit from all the features of SQL.
NoSQL is the "correct" answer if your queries are KV oriented, while predictable performance and high availability are priority (true for most "control planes"). Don't think any well-designed system will usually need to "graduate" from NoSQL to SQL.
Prior: https://news.ycombinator.com/item?id=22249490
No, no it isn't. It never is. Just as building your house on a rubber foundation isn't the correct answer either. This is just cope. Unless your use cases don't care about losing data or data corruption at all, NoSQL isn't the correct answer.
This is by design, the idea is that scaling your application layer is easy but scaling your storage/db layer is not
Hence make the storage dumb and have the application do the joins and now your app scales right up
(But tbh I agree a lot of applications don’t reach the scale required to benefit from this)
Edit: I just submitted a link to Joe Armstrong's Minimum Viable Programs article from 2014. If the response to my comment is about the enterprise and imaginary scaling problems, realize that those situations don't apply to some programming problems.
Why waste time screwing around with ad-hoc file reads, then?
I mean, what exactly are you buying by rolling your own?
“Virding's First Rule of Programming: Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.”
https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule
Wait until you actually need it.
In this case, I feel like using the filesystem directly is the opposite: doing much more difficult programming and creating more complex code, in order to do less.
It depends on how you weigh the cost of the additional dependency that lets you write simpler code, of course, but I think in this case adding a SQLite dependency is a lower long-term maintenance burden than writing code to make atomic file writes.
The original post isn't about simplicity, though. It's about performance. They claim they achieved better performance by using the filesystem directly, which could (if they really need the extra performance) justify the extra challenge and code complexity.
Premature optimisation I believe that's called.
I've seen it play out many times in engineering over the years.
You are just mislabling good architecture as 'premature optimization'. So I will give you another platitude... "There is nothing so permanent as a temporary software solution"
you should be squashing bugs related to your business logic, not core data storage. Local data storage on your one horizontally-scaling box is a solved problem using SQLite. Not to mention atomic backups?
What is more likely, if you are making good decisions, is that you'll reach a point where the simple approach will fail to meet your needs. If you use the same attitude again and choose the simplest solution based on your _need_, you'll have concrete knowledge and constraints that you can redesign for.
Surely it does? Otherwise you cannot trust the interface point with SQLite and you're no further ahead. SQLite being flawless doesn't mean much if you screw things up before getting to it.
Regardless of whether most apps have enough requests per second to "need" a database for performance reasons, these are extremely important topics for any app used by a real business.
At the very least, use a monotonic key, so you can avoid having to periodically sort.
And then another index. And at some point you want to ensure uniqueness or some other constraint.
And then you’re rewriting a half-complete and buggy SQLite. So I’ve come around to defaulting to SQLite/PostgresQL unless I have a compelling need otherwise. They’re usually the right long-term choice for my needs.
That it's now in the box (node:sqlite) for Deno/TS makes it that much more of an easy button option.
That's why it could handle massive traffic with very little issues.
It only handles massive traffic if reads of those static pages are frequent, and updates are rare. When thousands of users are posting, you have to either block everyone on each write, or subdivide the required synchronisation between boards. Also, the regenerated pages might be used just a couple of times before the next rewrite happens (or not viewed at all, like far away board pages, but you still have rewrite all involved pages as a batch), and not much caching happens. In addition to that, each short comment causes multiple full long pages to be regenerated, and stored on disk. You basically get a very ineffective database with very ineffective primitives.
So usually every image board owner starts with decoupling of message posting and page updates to only regenerate pages once in a while, and process multiple edits at once, then some things stop working, as they assume each state change happens in full in multiple places, then they try to rewrite everything or switch to a real database engine.
Just have to use locks to be careful with writes.
I figured I'd migrate it to a database after maybe 10k users or so.
Sadly no solution for non-rooted consoles.
Overhead in any project is understanding it and onboarding new people to it. Keeping on "mainline" path is key to lower friction here. All 3 languages have well supported ORM that supports SQLite.
A lot of data pipeline work can function pretty well with each step producing artifacts that the next can consume. If you need unique items, use uuids. If you want to check whether an item exists, use a hash of something (maybe a url). And each day goes in its own subdirectory. It works.
Sometimes, though, you end up reinventing database constructs. Joins, indexing, versioning. A little of that is ok, but it gets gnarly fast. If you mess up, it could mean corrupting your data in a way that’s irreversible. Or, corrupting your data in a way that’s reversible but really hard.
So a database has the benefit of enforcing consistency that a glob over json files doesn’t. It doesn’t mean every one-off script needs a blue-green Postgres DB, but there can be an inflection point.
In my (hypothetical, 'cause I never actually sat down and wrote that) case, I wanted the personal transactions in a month, and I realized I could just keep one single file per month, and read the whole thing at once (also 'cause the application would display the whole month at once).
Filesystems can be considered a key-value (or key-document) database. The funny thing about the example used in the link is that one could simply create a structure like `user/[id]/info.json` and directly access the user ID instead of running some file to find them -- again, just 'cause the examples used, search by name would be a pain, and one point where databases would handle things better.
Now that said, if there's value to the "database" being human readable/editable, json is still well worth a consideration. Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.
How? With SQL is super easy to search, compare, and update data. That's what it’s built for.
However, if your all application state can be represented in a single json file of less than a dozen MB, yes, a database can be overkill.
So my opinion has thoroughly shifted to "start with a database, and if you _really_ don't need one it'll be obvious.
But you probably do.
I ended up just buying a VPS, putting openclaw on it, and letting it Postgres my app.
I feel like this article is outdated since the invention of OpenClaw/Claude Opus level AI Agents. The difficulty is no longer programming.
Then proceeds to (poorly) implement database on files.
Sure, Hash Map that take ~400mb in memory going to offer you fast lookups. Some workloads will never reach this size can be done as argument, but what are you losing by using SQLite?
What happens when services shutdowns mid write? Corruption that later results in (poorly) implemented WAL being added?
SQLite also showed something important - it was consistent in all benchmarks regardless of dataset size.
Nope. There are non-persistent in-memory databases too.
In fact, a database can be a plethora of things and the stuff they were building is just a subset of a subset (persistent, local relational database)
It's the opposite. A file system is a database. And databases can recursively store their data within another database.
But if you have data that is static or effectively static (data that is updated occasionally or batched), then serving via custom file handling can have its place.
If the records are fixed width and sorted on the key value, then it becomes trivial to do a binary search on the mmapped file. It's about as lightweight as could be asked for.
1. You're paying very little for 1000x the features.
2. The chances your application will keep doing "pure ID lookups" forever are zero. The moment you need to query the data in any other way or have more than one writer you're going to have to throw all this nonsense code into the trash.
3. Do you need the 1.7x speedup? No, of course you don't. It's just optimizing for the sake of optimizing.
I'd have just used sqlite to begin with and not benchmarked anything. No custom code, no need to waste any time, great performance, amazing flexibility — all with minimum effort.
If you want to do this for fun or for learning? Absolutely! I did my CS Masters thesis on SQL JOINS and tried building my own new JOIN indexing system (tl;dr: mine wasn't better). Learning is fun! Just don't recommend people build production systems like this.
Is this article trolling? It feels like trolling. I struggle to take an article seriously that conflates databases with database management systems.
A JSON file is a database. A CSV is a database. XML (shudder) is a database. PostgreSQL data files, I guess, are a database (and indexes and transaction logs).
They never actually posit a scenario in which rolling your own DBMS makes sense (the only pro is "hand rolled binary search is faster than SQLite"), and their "When you might need" a DBMS misses all the scenarios, the addition of which would cause the conclusion to round to "just start with SQLite".
It should basically be "if you have an entirely read-only system on a single server/container/whatever" then use JSON files. I won't even argue with that.
Nobody - and I mean nobody - is running a production system processing hundreds of thousands of requests per second off of a single JSON file. I mean, if req/sec is the only consideration, at that point just cache everything to flat HTML files! Node and Typescript and code at all is unnecessary complexity.
PostgreSQL (MySQL, et al) is a DBMS (DataBase Management System). It might sound pedantic but the "MS" part is the thing you're building in code:
concurrency, access controls, backups, transactions: recovery, rollback, committing, etc., ability to do aggregations, joins, indexing, arbitrary queries, etc. etc.
These are not just "nice to have" in the vast, vast majority of projects.
"The cases where you'll outgrow flat files:"
Please add "you just want to get shit done and never have to build your own database management system". Which should be just about everybody.
If your app is meaningfully successful - and I mean more than just like a vibe-coded prototype - it will break. It will break in both spectacular ways that wake you up at 2AM and it will break in subtle ways that you won't know about until you realize something terrible has happened and you lost your data.
Didn't we just have this discussion like yesterday (https://ultrathink.art/blog/sqlite-in-production-lessons)?
It feels like we're throwing away 50 years of collective knowledge, skills, and experience because it "is faster" (and in the same breath note that nobody is gonna hit these req/sec.)
I know, it's really, really hard to type `yarn add sqlite3` and then `SELECT * FROM foo WHERE bar='baz'`. You're right, it's so much easier writing your own binary search and indexing logic and reordering files and query language.
Not to mention now you need a AGENTS.md that says "We use our own home-grown database nonsense if you want to query the JSON file in a different way just generate more code." - NOT using standard components that LLMs know backwards-and-forwards? Gonna have a bad time. Enjoy burning your token budget on useless, counter-productive code.
This is madness.
Even it's postgres, it is still a file on disk. If there is need something like like partitioning the data, it is much more easier to write the code that partitions the data.
If there is a need to adding something with textinputs, checkboxes etc, database with their admin tools may be a good thing. If the data is something that imported exported etc, database may be a good thing too. But still I don't believe such cases, in my ten something years of software development career, something like that never happened.
…and it won’t get better anytime soon.
That alone is a terrible thing. Open formats are so much more user friendly
Don't use a sports-car to haul furniture or a garbage truck as an ambulance. For the use case and scale mentioned in the article it's obvious not to use a database.
Am I missing something? I guess many people are the using the tools they are familiar with and rarely question whether they are really applicable. Is that the message?
I think a more interesting question is whether you will need a single source of truth. If you don't you can scale on many small data sets without a database.
I will say this before I shut up with my rant: If you start with a design that scales you will have an easier to scale when it is time without re-engineering your stack. Whether you think you will need to scale depends on your projected growth and the nature of your problem (do you need a single source of truth, etc.)
Edits: Spelling