The principles of database design, or, the Truth is out there

(ebellani.github.io)

121 points b-man | 1 comments | 19 May 25 02:58 UTC | HN request time: 0.353s | source

Show context

adamcharnock ◴[19 May 25 04:34 UTC] No.44026525[source]▶

> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.

I fairly strongly disagree with this. Database identifiers have to serve a lot of purposes, and natural key almost certainly isn’t ideal. Off the top my head, IDs can be used for:

- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.

- Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)

- Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).

There is not one solution that handles all of these well. But using natural keys is one of the least good options.

Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.

So ideally I’d put a unique index on the national ID column. But realistically, it would be no unique constraint and instead form validation + a warning on anytime someone opened a screen for a user with a non-unique ID.

Then maybe a BIGINT for database ID, and a UUID4/7 for exposing to the world.

EDIT: Actually, the article is proposing a new principle. And so perhaps this could indeed be a viable one. And my comment above would describe situations where it is valid to break the principle. But I also suspect that this is so rarely a good idea that it shouldn’t be the default choice.

replies(12): >>44026711 #>>44026778 #>>44026822 #>>44026881 #>>44027090 #>>44027969 #>>44028025 #>>44029492 #>>44030454 #>>44030564 #>>44030578 #>>44031152 #

jiggawatts ◴[19 May 25 05:46 UTC] No.44026778[source]▶

>>44026525 #

> will absolutely swear that there will never be two people with the same national ID...

I suddenly got flash-backs.

There are duplicate ISBN numbers for books, despite the system being carefully designed to avoid this.

There are ISBN numbers that have invalid checksums, but are valid ISBNs with the invalid number in the barcode and everything. Either the calculation was incorrectly done, or it was simply a mis-print.

The same book can have hundreds of ISBNs.

There is no sane way to determine if two such ISBNs are truly the same (page numbers and everything), or a reprint that has renumbered pages or even subtly different content with corrected typos, missing or added illustrations, etc...

Our federal government publishes a master database of "job id" numbers for each profession one could have. This is critical for legislation related to skilled migrants, collective workplace agreements, etc...

The states decided to add one digit to these numbers to further subdivide them. They did it differently, of course, and some didn't subdivide at all. Some of them have typos with "O" in place of "0" in a few places. Some states dropped the leading zeroes, and then added a suffix digit, which is fun.

On and on and on...

The real world is messy.

Any ID you don't generate yourself is fraught with risk. Even then there are issues such as what happens if the database is rolled back to a backup and then IDs are generated again for the missed data!

replies(1): >>44031396 #

1. lscharen ◴[19 May 25 16:15 UTC] No.44031396[source]▶

>>44026778 #

  The states decided to add one digit to these numbers to further subdivide
  them. They did it differently, of course, and some didn't subdivide at all.
  Some of them have typos with "O" in place of "0" in a few places. Some
  states dropped the leading zeroes, and then added a suffix digit, which is fun.

Any identifier that is comprised of digits but is not a number will have a hilariously large amount of mistakes and alterations like you describe.

In my own work I see this all the time with FIPS codes and parcel identifiers -- mostly because someone has round-tripped their data through Excel which will autocast the identifiers to numeric types.

Federal GEOIDs are particularly tough because the number of digits defines the GEOID type and there are valid types for 10, 11 and 12-digit numbers, so dropping a leading zero wreaks havoc on any automated processing.

There's a lot of ways to create the garbage in GIGO.

↑