The principles of database design, or, the Truth is out there

(ebellani.github.io)

121 points b-man | 1 comments | 19 May 25 02:58 UTC | HN request time: 0s | source

Show context

adamcharnock ◴[19 May 25 04:34 UTC] No.44026525[source]▶

> A relation should be identified by a natural key that reflects the entity’s essential, domain-defined identity — not by arbitrary or surrogate values.

I fairly strongly disagree with this. Database identifiers have to serve a lot of purposes, and natural key almost certainly isn’t ideal. Off the top my head, IDs can be used for:

- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.

- Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)

- Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).

There is not one solution that handles all of these well. But using natural keys is one of the least good options.

Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.

So ideally I’d put a unique index on the national ID column. But realistically, it would be no unique constraint and instead form validation + a warning on anytime someone opened a screen for a user with a non-unique ID.

Then maybe a BIGINT for database ID, and a UUID4/7 for exposing to the world.

EDIT: Actually, the article is proposing a new principle. And so perhaps this could indeed be a viable one. And my comment above would describe situations where it is valid to break the principle. But I also suspect that this is so rarely a good idea that it shouldn’t be the default choice.

replies(12): >>44026711 #>>44026778 #>>44026822 #>>44026881 #>>44027090 #>>44027969 #>>44028025 #>>44029492 #>>44030454 #>>44030564 #>>44030578 #>>44031152 #

Jarwain ◴[19 May 25 05:28 UTC] No.44026711[source]▶

>>44026525 #

Why have both a database ID and UUIDv7, versus just a UUIDv7?

replies(2): >>44026865 #>>44026928 #

sroussey ◴[19 May 25 06:04 UTC] No.44026865[source]▶

>>44026711 #

There is a security principle to not expose real identifiers to the outside world. It makes a crack in your system easier to open.

replies(1): >>44027174 #

Jarwain ◴[19 May 25 07:05 UTC] No.44027174[source]▶

>>44026865 #

Idk that reeks of security through obscurity to me. Your authorization/permission scheme has got to be fubar'd if you're relying on obscuring IDs to prevent someone from accessing a resource they shouldn't.

I'm sure I'm missing something obvious, but I'm not sure what other threat vectors there are, assuming it's in conjunction with other layers of security like a solid authorization/access control scheme.

I guess I'm not the biggest fan for a few reasons. I'd rather try and design a system such that it's secure even in the case of database leak/compromise or some other form of high transparency. I don't want to encourage a culture of getting used to obscurity and potentially depending on it instead of making sure that transparency can't be abused.

Also, it just feels wasteful. If you have two distinct IDs for a given resource, what are you building your foreign keys against? If you build it against the hidden one, and want to query the foreign table based on user input, you've gotta either do a join or do a query to Get the hidden key just to do another query. It just feels wasteful.

EDIT: apparently RFC 4122 even says not to assume UUIDs are hard to guess and shouldn't be used for security capabilities. So if it shouldn't be depended on for security, why add all this complexity to keep it secure?

replies(1): >>44027253 #

jandrewrogers ◴[19 May 25 07:22 UTC] No.44027253[source]▶

>>44027174 #

The point you may be missing is that the key itself contains information about records in the database that you don’t have access to. There are many famous examples in literature (e.g. the German Tank Problem [0]) of useful attacks on known unique serials to infer hidden information without access. In a database context, the keys for the records you are authorized to see can tell you much about the parts of the database to which you have no access.

Strong randomization of record serials mitigates these attacks.

[0] https://en.wikipedia.org/wiki/German_tank_problem

replies(3): >>44027310 #>>44030215 #>>44051982 #

1. sgarland ◴[19 May 25 14:21 UTC] No.44030215[source]▶

>>44027253 #

The only thing UUIDv7 exposes is its creation time, which isn't tremendously useful or secret information.

↑