FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
88% Positive
Analyzed from 1362 words in the discussion.
Trending Topics
#duckdb#database#data#ducklake#using#more#process#storage#query#read

Discussion (21 Comments)Read Original on HackerNews
Columnar storage is very effectively compressed so one "page" actually contains a lot of data (Parquet rowgroups default to 100k records IIRC). Writing usually means replacing the whole table once a day or appending a large block, not many small updates. And reading usually would be full scans with smart skipping based on predicate pushdown, not following indexes around.
So the same two million row table that in a traditional db would be scattered across many pages might be four files on S3, each with data for one month or whatnot.
But also in this space people are more tolerant of latency. The whole design is not "make operations over thousands of rows fast" but "make operations over billions of rows possible and not slow as a second priority".
So if you typically use a file-backed DuckDB database in one process and want to quickly modify something in that database using the DuckDB CLI (like you might connect SequelPro or DBeaver to make changes to a DB while your main application is 'using' it), then it complains that it's locked by another process and doesn't let you connect to it at all.
This is unlike SQLite, which supports and handles this in a thread-safe manner out of the box. I know it's DuckDB's explicit design decision[0], but it would be amazing if DuckDB could behave more like SQLite when it comes to this sort of thing. DuckDB has incredible quality-of-life improvements with many extra types and functions supported, not to mention all the SQL dialect enhancements allowing you to type much more concise SQL (they call it "Friendly SQL"), which executes super efficiently too.
[0] https://duckdb.org/docs/current/connect/concurrency
If you want to achieve concurrent writes you can always embed Duckdb in a web app of some kind. This doesn't seem that different from any other DBMS, it's just that DuckDB is an embedded database.
I updated your reference [0] with this information.
You can do read/write of a parquet folder on your local drive, but managed by DuckLake. Supports schema evolution and versioning too.
Basically SQLite for parquet.
We ending up building a Sqlite + vortex file alternative for our use case: https://spice.ai/blog/introducing-spice-cayenne-data-acceler...
Differential storage
Append-only layers with PostgreSQL metadata. DuckDB sees a normal file; OpenDuck persists data as immutable sealed layers addressable from object storage. Snapshots give you consistent reads. One serialized write path, many concurrent readers.
Hybrid (dual) execution
A single query can run partly on your machine and partly on a remote worker. The gateway splits the plan, labels each operator LOCAL or REMOTE, and inserts bridge operators at the boundaries. Only intermediate results cross the wire.
OpenDuck takes a different approach with query federation with a gateway that splits execution across local and remote workers. My use case requires every node to serve reads independently with zero network latency, and to keep running if other nodes go down.
The PostgreSQL dependency for metadata feels heavy. Now you're operating two database systems instead of one. In my setup DuckDB stores both the Raft log and the application data, so there's a single storage engine to reason about.
Not saying my approach is universally better. If you need to query across datasets that don't fit on a single machine, OpenDuck's architecture makes more sense. But if you want replicated state with strong consistency, Raft + DuckDB works very well.
When I look at SQLite I see a clear message: a database in a file. I think DuckDb is that, too. But it’s also an analytics engine like Polars, works with other DB engines, supports Parquet, comes with a UI, has two separate warehouse ideas which both deviate from DuckDB‘s core ideas.
Yes, DuckLake and Motherduck are separate entities, but they are still part of the ecosystem.
However I'd like to point out that that is exactly the reason why DuckDB relies so heavily on its extension mechanism, even for features that some may consider to be "essential" for an analytical system. Take for example the parquet, json, and httpfs extensions. Also features like the UI you mention are isolated from core DuckDB by living in an extension.
I'd argue that core DuckDB is still very much the same lightweight, portable, no-dependency system that it started out as (and which was very much inspired by how effective SQLite is by being so).
Maybe some interesting behind-the-scenes: to further solidify core DuckDB and guard it from the complexity of its ever growing extension ecosystem, one of the big items currently on our roadmap (see https://duckdb.org/roadmap) is to make significant improvements to DuckDB's stable C extension API.
disclaimer: I work at DuckDB Labs ;)
But it's also stuff like `"SELECT * FROM my_df"` – It's super cool but why is my database connecting to an in-memory pandas data frame? On the other hand, DuckDB can connect to remote Parquet files and interact with them without (explicitly) importing them.
In these examples, DuckDB feels more like an ephemeral SQL-esque Pandas/Polars alternative rather than a database.
Probably it's just me losing track of what a database is and we've evolved from "a monolithic and permanent thing that you store data on and read data from via queries".
Obviously not a production implementation.
Show HN style posts have become completely worthless to me, everything now is just vibe coded cloud chasing slop.
In my case my systems can produce "warnings" when there are some small system warning/errors, that I want to aggregate and review (drill-down) from time to time
I was hesitating between using something like OpenTelemetry to send logs/metrics for those, or just to add a "warnings" table to my Timescaledb and use some aggregates to drill them down and possibly display some chunks to review...
but another possibility, to avoid using Timescaledb/clickhouse and just rely on S3 would be to upload those in a parquet file on a bucket through duckdb, and then query them from time to time to have stats
Would you have a recommendation?