Lakebase architecture delivers faster Postgres writes

ssp_from_db 2 days ago 19 commentsRead Article on databricks.com

ZH version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

57% Positive

Analyzed from 783 words in the discussion.

Discussion (19 Comments)Read Original on HackerNews

gavinray•38 minutes ago

So, the general architecture described here is solid, and I support it, but I take issue with the "Lakebase" naming thing.

Disaggregated storage and disaggregated compute have been an open trend in DBMS development for the last half-decade. This is an obvious move with modern computing paradigms, and the academic literature has a standard name for it.

This feels like "JAMStack" from Netlify happening all over again.

I tweeted about this in 2022, as a general trend, and also from the RocksDB meetup emphasizing disaggregated storage:

- https://x.com/GavinRayDev/status/1607769112234823680

- https://x.com/GavinRayDev/status/1600666127025156096

jeremyjh•35 minutes ago

I don't think it should be surprising that vendors are not going to lead with "disaggregated storage". I don't see that taking off either. This isn't a paper in a journal. Aurora doesn't call it that either. But yes, it is not a new idea.

gavinray•27 minutes ago

Avoiding industry-standard names and trying to introduce your own convention comes off as hubristic and grift-ey to me.

"Basic literacy" -> "Prompt Engineering"

"P2P networking" -> "Web3"

"Service-Oriented Architecture" -> "Microservices"

Maybe I'm old-man-yelling-at-cloud.

jeremyjh•11 minutes ago

Its more like old-man-yelling-at-billboard. Its just marketing. Its like complaining about the font they chose.

nikita•30 minutes ago

Lakebase is referring to the fact that in addition to disaggregated storage s3 is authoritative storage for older data.

Since data is on s3 (or lake) you can perform direct to s3 type operations like data loading, reading this data by engines that are not Postgres and more

gavinray•19 minutes ago

  > in addition to disaggregated storage s3 is authoritative storage for older data

Suppose a person retrives cold data from another Object Storage protocol rather than S3. This is no longer a "Lakebase", so we have to come up with a different name to avoid confusion.

But if you say "Disaggregated Storage on S3" then you have the flexibility to change that to "Disaggregated Storage on FOOBAR" to avoid confusion.

jeremyjh•9 minutes ago

Databricks doesn't only run on S3, it will use native features of other providers like ADLS2 on those platforms. Lakebase refers to the idea that you can have something like Spark reading the same files for (transformations to) analytics use cases. Bronze layer is already in the lake and its updated in real-time.

uhoh-itsmaciek•about 3 hours ago

>Without those periodic full page images in the log, the storage layer would have to replay an infinitely long chain of small deltas to reconstruct a page for a read request. What was once a bounded O(checkpoint frequency) replay becomes an unbounded chain, leading to a spike in read latency and resource consumption.

I don't follow: read requests are not served from the WAL. They read the current state of the page from the buffer cache, where the page is updated after the change (FPI or not) is written to the WAL.

nikita•about 2 hours ago

This applies to our storage implementation. In Lakebase architecture storage serves pages and it doesn't always have the most recent version of the page and therefore it reconstructs it on demand.

In the past we relied on Postgres compute to periodically send a full page so reconstructive a page was always a bounded process. Once we turned it off (and got all those perf gains) we got another problem: unbounded page reconstruction which we had to solve separately.

erikcw•about 1 hour ago

How does Lakebase compare to Ducklake[0]?

[0] https://ducklake.select/

jeremyjh•15 minutes ago

Lakebase is for transactional use cases - this is more comparable to AWS Aurora. I think the differentiator is that the storage format is open so Spark can read directly from the same store - there is no need to replicate data to the bronze layer in a "Lakehouse" architecture.

nikita•35 minutes ago

Lakebase is OLTP.

nikita•about 3 hours ago

I'm a VP on Databricks and former CEO of Neon. Happy to answer performance related or any other questions here.

weli•about 3 hours ago

How does it affect HA postgres? (Replicas, consensus, etc). Especially with extensions like citus.

nikita•about 2 hours ago

This specific perf improvement is orthogonal to HA.

However generally disaggregating storage makes HA simpler and allows for things like zero downtime patching: https://www.databricks.com/blog/zero-downtime-patching-lakeb...

Read replicas can be "shallow". You don't need to replicate all the data to create a replica. This allows to create them very very quickly (sub second).

All the extension still work. We don't support Citus today, but mostly because customers are not asking for it rather due to technical limitations. We support lots of extensions: https://docs.databricks.com/aws/en/oltp/projects/extensions

Veelox•about 2 hours ago

Thanks for offering. In the graph labeled "Prod customer throughput: (higher is better)" eyeballing it within a week you are seeing ~2k qps peak increase over the previous week.

Operationally, how do you handle landing that large of a perf improvement? If my data store changed that much in a week it could break something.

mystraline•about 2 hours ago

Im not a proper DBA, but oversee some basic postgres installs (read: logging, monitoring, upgrades).

This appears to only have any effect with datalake style installs, where storage is separate from compute.

Not going to have any effect on those small postgres installs for that generic one off app.

tempest_•about 2 hours ago

Everyone thinks they need a data lake when most people just need a data pond or data puddle. This is made worse by the industry disappearance of the DBA role and compounded by the fact that PG is not especially easy to tune.

All of this to say that a ton of people are on some sort of managed cloud postgres where the compute is almost always separated from the storage even for the small instances.

Neon et al. will tell you they scale, and I am sure they can but the number of enterprises that actually exceed when can be put on a few large servers in pretty low. You gotta lock them in early so their orgs never develop the expertise to move off on the off chance they get big.

nikita•about 2 hours ago

We provide you fully managed Postgres. Lots of our customers use it for lots of small instances of Postgres since using Lakebase is so lightweight.

Small and large instances benefit from this performance optimization.