The principles of database design, or, the Truth is out there

(ebellani.github.io)

121 points b-man | 2 comments | 19 May 25 02:58 UTC | HN request time: 0s | source

Show context

adamcharnock ◴[19 May 25 04:34 UTC] No.44026525[source]▶

> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.

I fairly strongly disagree with this. Database identifiers have to serve a lot of purposes, and natural key almost certainly isn’t ideal. Off the top my head, IDs can be used for:

- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.

- Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)

- Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).

There is not one solution that handles all of these well. But using natural keys is one of the least good options.

Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.

So ideally I’d put a unique index on the national ID column. But realistically, it would be no unique constraint and instead form validation + a warning on anytime someone opened a screen for a user with a non-unique ID.

Then maybe a BIGINT for database ID, and a UUID4/7 for exposing to the world.

EDIT: Actually, the article is proposing a new principle. And so perhaps this could indeed be a viable one. And my comment above would describe situations where it is valid to break the principle. But I also suspect that this is so rarely a good idea that it shouldn’t be the default choice.

replies(12): >>44026711 #>>44026778 #>>44026822 #>>44026881 #>>44027090 #>>44027969 #>>44028025 #>>44029492 #>>44030454 #>>44030564 #>>44030578 #>>44031152 #

b-man ◴[19 May 25 13:11 UTC] No.44029492[source]▶

>>44026525 #

- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.

I struggle to see a practical example.

> - Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)

Natural keys solves this

> - Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).

The you have another piece of data, which you relate to the natural key. Something like `exposed-name`.

> There is not one solution that handles all of these well

Natural keys solve these issues.

> Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.

If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.

> Actually, the article is proposing a new principle

I'm putting it in words, but such knowledge has been common in the database community for ages, afaict.

replies(4): >>44030260 #>>44030318 #>>44031004 #>>44035811 #

1. hobs ◴[19 May 25 14:32 UTC] No.44030318[source]▶

>>44029492 #

> I struggle to see a practical example.

Memory, and CPU, and even storage eventually, those would be the main practical examples of where having a key that's composed of something very small saves you space and thus, time.

Say we want to use a bigint key vs a VARCHAR(30)? depending on your big key you might be talking about terabytes of additional data, just to store a key (1t rows @ bigint = 8TB, 1T rows at 30 chars? 30TB...). The data also is going to constantly shuffle (random inserts).

If you want to define the PK as the natural key with no separate column then you get to do comparisons on all the natural key columns themselves, so instead of doing 1 4 or 8 byte column comparison you get to do what? 5 char comparisons?

Having worked extensively in ETL - when a developer tells me "there's no duplication about this real world process" what they mean is "there's no duplication in my mental model about this real world process"

replies(1): >>44036480 #

2. b-man ◴[20 May 25 00:26 UTC] No.44036480[source]▶

>>44030318 (TP) #

> Memory, and CPU, and even storage eventually, those would be the main practical examples of where having a key that's composed of something very small saves you space and thus, time.

> Say we want to use a bigint key vs a VARCHAR(30)? depending on your big key you might be talking about terabytes of additional data, just to store a key (1t rows @ bigint = 8TB, 1T rows at 30 chars? 30TB...). The data also is going to constantly shuffle (random inserts).

>> Joins, lookups, indexes

I don't see how what you brought up has anything to do with these.

But the main point is being missed here because of a physical vs logical conflation anyhow.

↑