The principles of database design, or, the Truth is out there

(ebellani.github.io)

Show context

jandrewrogers ◴[19 May 25 05:53 UTC] No.44026812[source]▶

This takes an overly simple view of what domains can look like. There are data models that necessarily violate these principles, and they aren’t all that rare.

Some examples:

> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity

In some domains there is no natural key because the identity is literally an inference problem and relations are probabilistic. The objective of the data model is to aggregate enough records to discover and attribute natural keys with some level of confidence. A common class of data models with this property are entity resolution data models.

> All information in the database is represented explicitly and in exactly one way

Some data models have famously dual natures. Cartographic data models, for example, must be represented both as a graph models (for routing and reachability relationships) and as geometric models (for spatial relationships). The “one true representation” has been a perennial argument in mapping for my entire life and both sides are demonstrably correct.

> Every base relation should be in its highest normal form (3, 5 or 6th normal form).

This is one of those things that sounds attractive because it ignores that it requires no ambiguities about domain boundaries or semantics, which doesn’t exist in practice. I bought into this idea too when I was a young and naive data modeler. Trying to tamp out these ambiguities adds an unbounded number of data model epicycles that add a lot of complexity and performance loss. At some point, strict normalization is not worth the cost in several aspects.

In almost all cases, it is far more important that the data model be efficient to work with than it be the abstract platonic ideal of a domain model. All of these principles have to work on real hardware in real operational environments with all of the messy limitations that implies.

replies(2): >>44028007 #>>44036521 #

1. Akronymus ◴[19 May 25 09:40 UTC] No.44028007[source]▶

>>44026812 #

I strive to keep most things in at least the 3rd NF. Except stuff like addresses or names, those I intentionally don't push to be normalized, as they make up a single datum anyways, and normalizing creates more problems than it solves IME.

replies(2): >>44028661 #>>44030322 #

2. friendzis ◴[19 May 25 11:31 UTC] No.44028661[source]▶

>>44028007 (TP) #

Addresses and names are nice, well-known examples for cross-domain data. It's not that attempts at normalizing these structural datums create problems per se, but rather there is no single true normalization, therefore wrong normalizations start causing problems.

replies(1): >>44028735 #

3. Akronymus ◴[19 May 25 11:39 UTC] No.44028735[source]▶

>>44028661 #

Yeah, normalizing inherently introduces constraints on the data. For example, normalizing to first and last name implies no middle names/having a first and last in the first place.

Also, first and last names depend on the culture. Oh and people can have more than 1 name (as in distinct names, rather than multi part names. Some cultures use different names with different social circles).

Easier to just let them put their preferred name into a freeform text field.

replies(1): >>44052103 #

4. sgarland ◴[19 May 25 14:33 UTC] No.44030322[source]▶

>>44028007 (TP) #

> Except stuff like addresses or names, those I intentionally don't push to be normalized

IMO, it depends. While the idea of normalizing names is amusing, I probably wouldn't ever push for that either. For addresses, though, I would absolutely normalize everything beyond the street address. If nothing else, it enables you to quite cheaply (from a storage / memory perspective) add a lot of analytics on users that might be useful, but would be expensive to store with every record - things like city population, postal code median income, etc.

And in cases where you need to have your own company's address displayed per-user (disclosures for financial products, for example), it's absolutely a good idea. A full address, especially a business one that might have a Suite, Floor, etc. can easily be 60-80 chars, which over hundreds of millions of rows, adds up.

replies(1): >>44030379 #

5. Akronymus ◴[19 May 25 14:37 UTC] No.44030379[source]▶

>>44030322 #

> I would absolutely normalize everything beyond the street address. If nothing else,

I'd personally rather store those datums in additional fields/tables, extracted from the address.

6. HelloNurse ◴[21 May 25 14:51 UTC] No.44052103{3}[source]▶

>>44028735 #

Very importantly, the value can be adapted to its purpose. A person's name can be very different if it is to be used for credits in a publication, for a formal wedding invitation and guest list or for shipping to a specific address.

↑