FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
62% Positive
Analyzed from 2680 words in the discussion.
Trending Topics
#data#table#forms#normal#why#don#database#joins#going#where

Discussion (70 Comments)Read Original on HackerNews
> If you read any text that defines 4NF, the first new term you hear is “multivalued dependency”. [Kent 1983] also uses “multivalued facts”. I may be dumb but I only very recently realized that it means just “a list of unique values”. Here it would be even better to say that it’s a list of unique IDs.
This is an inaccurate characterization, and the rest of the post only makes sense when viewed through this strawman. The reason 4NF is explained in the "weird, roundabout way" is because it demonstrates [one of] the precise problem[s] the normal form sets out to solve: a combinatorial explosion of rows.
If you have a table:
If you only ever add an additional supplier or an additional warehouse for a given product, it's only adding one row. But if you add both to the same product, you now have 4 rows for a single product; if you add 5 suppliers and 3 warehouses to the same product, you now have 15 rows for a single product, etc. This fact might be lost on someone if they're creating a table with future expansion in mind without thinking it through, because they'd never hit the cross-product, so the design would seem reasonable.The conclusion reached (modulo treating an array as an atomic value) is in fact in 4NF, but it doesn't make any sense why it's needed if you redefine multivalued dependency to mean a set.
Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM) doesn't mean it is actually useful... sometimes it just means that it made it easy to write multiple choice test questions. (e.g., "What does Layer 2 of the OSI network model represent? A: ... B: ... C: ... D: ..." to which the most appropriate real-world answer is "Who cares?")
Breaking 1NF is essentially always incorrect. You're fundamentally limiting your system, and making it so that you will struggle to perform certain queries. Only break 1NF when you're absolutely 100% certain that nobody anywhere will ever need to do anything even slightly complex with the data you're looking at. And then, probably still apply 1NF anyways. Everyone that ever has to use your system is going to hate you when they find this table because you didn't think of the situation that they're interested in. "Why does this query use 12 CTEs and random functions I've never heard of and take 5 minutes to return 20,000 rows?" "You broke 1NF."
2NF is usually incorrect to break. Like it's going to be pretty obnoxious to renormalize your data using query logic, but it won't come up nearly as frequently. If it's really never going to come up that often in practical terms, then okay.
3NF and BCNF are nice to maintain, but the number of circumstances where they're just not practical or necessary starts to feel pretty common. Further, the complexity of the query to undo the denormalization will not be as obnoxious as it is for 1NF or 2NF. But if you can do it, you probably should normalize to here.
4NF and higher continue along the same lines, but increasingly gets to what feels like pretty arbitrary requirements or situations where the cost you're paying in indexes is starting to become higher than the relational algebra benefits. Your database disk usage by table report is going to be dominated by junction tables, foreign key constraints, and indexes, and all you're really buying with that disk space is academic satisfaction.
FK constraints add a negligible amount of space, if any. The indexes they require do, certainly, but presumably you're already doing joins on those FKs, so they should already be indexed.
Junction tables are how you represent M:N relationships. If you don't have them, you're either storing multiple values in an array (which, depending on your POV, may or may not violate 1NF), or you have a denormalized wide table with multiple attributes, some of which are almost certainly NULL.
Also, these all serve to prevent various forms of data anomalies. Databases must be correct above all else; if they're fast but wrong, they're useless.
Yeah, the problem is that when you get to 4NF+, you're often looking at creating a new table joining through a junction table for a single multi-valued data field that may be single values a plurality or majority of the time. So you need the base table, the junction table that has at least two columns, and the actual data table.
So, you've added two tables, two foreign key constraints, two primary key indexes, potentially more non-clustered indexes... and any query means you need two joins. And data validation is hard because you need to use an anti-join to find missing data.
Or, you can go with an 1:N relationship. Now you have only one more table at the cost of potentially duplicating values between entities. But if we're talking about, say, telephone numbers? Sure, different entities might share the same phone number. Do you need a junction table so you don't duplicate a phone number? You're certainly not saving disk space or improving performance by doing that unless there's regularly dozens of individual records associated to a single phone number.
And if the field is 1:1... or even 90% or 95% 1:1... do you really need a separate table just so you don't store a NULL in a column? You're not going to be eliminating nulls from your queries. They'll be full of LEFT JOINs everywhere; three-valued logic isn't going anywhere.
> Databases must be correct above all else; if they're fast but wrong, they're useless.
Yeah, and if they're "correct" but you can't get it to return data in a timely manner, they're also useless. A database that's a black hole is not an improvement. If it takes 20 joins just to return basic information, you're going to run into performance problems as well as usability problems. If 18 of those joins are to describe fidelity that you don't even need?
One problem is that normal forms are underspecified even by the academy.
E.g., Millist W. Vincent "A corrected 5NF definition for relational database design" (1997) (!) shows that the traditional definition of 5NF was deficient. 5NF was introduced in 1979 (I was one year old then).
2NF and 3NF should basically be merged into BCNF, if I understand correctly, and treated like a general case (as per Darwen).
Also, the numeric sequence is not very useful because there are at least four non-numeric forms (https://andreipall.github.io/sql/database-normalization/).
Also, personally I think that 6NF should be foundational, but that's a separate matter.
Well, we are roughly the same age then. Our is a cynical generation.
"One problem is that normal forms are underspecified even by the academy."
The cynic in me would say they were doing their job by the example I gave, which is just to provide easy test answers, after which there wasn't much reason to iterate on them. I imagine waiving around normalization forms was a good gig for consultants in the 1980 but I bet even then the real practitioners had a skeptical, arm's length relationship with them.
Even when it's purely performance-related, it usually gets a shrug, and "it's good enough." Cool, you're wrecking the B+tree, maybe don't do that. It's as if I said, "I'm using an array to store these millions of items that I later need to de-duplicate," and when someone suggests maybe using a set, I dismiss it.
To stay on the main topic, same for the "normalization forms". Do what your database needs.
The concepts are just attractive nuisances. They are more likely to hurt someone than to help them.
Certainly a lot more concise than the article or the works the article references.
And this is basically the main point of my critique of 4NF and 5NF. They both traditionally present an unexplained table that is supposed to be normalized. But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?
It's like saying that to in order to implement an algorithm you have to remove bugs from its original implementation — where does this implementation come from?
The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.
Because of that if you, as a novice, look at a typical production schema, and you have this "thou shalt normalize" instruction, you'll be confused.
This is my big teaching pet peeve.
I find the bafflement expressed in the article as well as the one linked extremely attractive. It made both a joy to read.
Were I to hazard a guess: Might it be a consequence of lack of disk space in those early decades, resulting into developers being cautious about defining new tables and failing to rationalise that the duplication in their tragic designs would result in more space wasted?
> The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.
Agreed, but as the OP comment stated they usually started out normalised and then pushed out denormalised representations for nice contiguous reads.
As a victim of maintaining a stack on top of an EAV schema once upon a time, I have great appreciation for contiguous reads.
I would maybe throw in date as an key too. Bad idea?
I tried to explain the real cause of overcounting in my "Modern Guide to SQL JOINs":
https://kb.databasedesignbook.com/posts/sql-joins/#understan...
Since I had bad memory, I asked the ai to make me a mnemonic:
* Every
* Table
* Needs
* Full-keys (in its joins)
Why do they hate us and do not provide any illustrative real-life example without using algebraic notation? Is it even possible?
I just want to see a CREATE TABLE statement, and some illustrative SELECT statements. The standard examples always give just the dataset, but dataset examples are often ambiguous.
> (in its joins)
Do you understand what are "its" joins? What is even "it" here.
I'm super frustrated. This paper is 14 years old.
I'll try reading it again.
https://www.oreilly.com/videos/c-j-dates-database/9781449336...
https://www.amazon.ca/Database-Design-Relational-Theory-Norm...
I think the main problem of how 4NF and 5NF formal definitions were taught is that essentially common sense (which is mostly "sufficient" to understand 1NF-3NF) starts to slip away, and you start needing the mathematical background that Ed Codd (and others) had. And trying to avoid that is how those weird examples came up.
I largely agree with your practical approach, but try and keep the data excited about the process, sell the "new use cases for the same data!" angle :)
Each process should take data from a golden source and not a pre-aggregated or overly normalized non-authorative source.
I find that JSON blobs up to about 1 megabyte are very reasonable in most scenarios. You are looking at maybe a millisecond of latency overhead in exchange for much denser I/O for complex objects. If the system is very write-intensive, I would cap the blobs around 10-100kb.
Considering the data transfer alone for 1 MB / 1 msec requires 8 Gbps, I have doubts. But for fun, I created a small table in Postgres 18 with an INT PK, and a few thousand JSONB blobs of various sizes, up to 1 MiB. Median timing was 4.7 msec for a simple point select, compared to 0.1 msec (blobs of 3 KiB), and 0.8 msec (blobs of 64 KiB). This was on a MBP M4 Pro, using Python with psycopg, so latency is quite low.
The TOAST/de-TOAST overhead is going to kill you for any blobs > 2 KiB (by default, adjustable). And for larger blobs, especially in cloud solutions where the disk is almost always attached over a network, the sheer number of pages you have to fetch (a 1 MiB blob will nominally consume 128 pages, modulo compression, row overhead, etc.) will add significant latency. All of this will also add pressure to actually useful pages that may be cached, so queries to more reasonable tables will be impacted as well.
RDBMS should not be used to store blobs; it's not a filesystem.