The principles of database design, or, the Truth is out there

(ebellani.github.io)

121 points b-man | 3 comments | 19 May 25 02:58 UTC | HN request time: 0s | source

Show context

adamcharnock ◴[19 May 25 04:34 UTC] No.44026525[source]▶

> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.

I fairly strongly disagree with this. Database identifiers have to serve a lot of purposes, and natural key almost certainly isn’t ideal. Off the top my head, IDs can be used for:

- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.

- Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)

- Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).

There is not one solution that handles all of these well. But using natural keys is one of the least good options.

Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.

So ideally I’d put a unique index on the national ID column. But realistically, it would be no unique constraint and instead form validation + a warning on anytime someone opened a screen for a user with a non-unique ID.

Then maybe a BIGINT for database ID, and a UUID4/7 for exposing to the world.

EDIT: Actually, the article is proposing a new principle. And so perhaps this could indeed be a viable one. And my comment above would describe situations where it is valid to break the principle. But I also suspect that this is so rarely a good idea that it shouldn’t be the default choice.

replies(12): >>44026711 #>>44026778 #>>44026822 #>>44026881 #>>44027090 #>>44027969 #>>44028025 #>>44029492 #>>44030454 #>>44030564 #>>44030578 #>>44031152 #

1. mamcx ◴[19 May 25 15:54 UTC] No.44031152[source]▶

>>44026525 #

We can debate the first usage (small type for storage/cpu optimization) but the other 2 are actually good examples of natural keys!

Note the quotes:

> Every software project needs to represent the reality of the business he is embedded in...

> Such database needs to be designed to properly reflect reality. This can’t be automated, since the semantics of the situation need to be encoded in a way that can be processed by a computer.

Having the business need to share an ID, not give internal information, make some data opaque, use a 'arbitary' id to share as 'textual pointer' to the actual data, etc are valid natural keys.

What is wrong is just add a surrogate key just because, when the fact is that a natural key (to the domain, the business, the requirements) is the actual answer.

I discover this fact when doing sync. The moment I get rid of UUIDs and similar and instead keep data-as-natural things become MORE clear:

ulid: 01H6GZ8YQJ code: A1 name: A

ulid: 01H6GZ8YQK code: A1 name: A

ulid: 01H6GZ8YQJL code: A1 name: A

// All same, obscured by wrong key

code: A1 name: A

(ie: The actual data and what I need to model a sync is NOT the same. Is similar to how in 'git' your code is one thing that is clearly distinct to the internal git structures that encode the repo)

replies(1): >>44051897 #

2. HelloNurse ◴[21 May 25 14:34 UTC] No.44051897[source]▶

>>44031152 (TP) #

I'd expect both code and name to be subject to deliberate changes (e.g. name "A" is extended to "Alpha" after the name column is extended from 1 to 5 characters) or "technical" edits (e.g. due to out of order data entry, code "A1" should become "A2" in order to insert the proper A1 record). Anything the user sees is a commitment.

replies(1): >>44054900 #

3. mamcx ◴[21 May 25 18:55 UTC] No.44054900[source]▶

>>44051897 #

Sure. Data is not static. But is foolish to pretend is.

Adding surrogate keys not changed that, only add other column that COULD change.

↑