Most active commenters

sedatk(5)
Dylan16807(5)
UltraSane(4)
throw0101c(4)
mlyle(4)
WorldMaker(4)
dragonwriter(3)

Popular/hot comments

>>42576212 #
>>42576069 #
>>42576599 #
>>42576248 #
>>42576494 #
>>42576998 #
>>42577202 #
>>42577882 #

Postgres UUIDv7 and per-back end monotonicity

(brandur.org)

1. urronglol ◴[02 Jan 25 16:49 UTC] No.42576069[source]▶

What is a v7 UUID. Why do we need more than 1. uuid from a random seed and 2. one derived from that and a timestamp (orderable)

replies(6): >>42576084 #>>42576085 #>>42576155 #>>42576193 #>>42576274 #>>42576349 #

2. n2d4 ◴[02 Jan 25 16:52 UTC] No.42576084[source]▶

>>42576069 #

UUID v7 is the latter, whereas v4 is the former.

All the other versions are somewhat legacy, and you shouldn't use them in new systems (besides v8, which is "custom format UUID", if you need that.)

replies(1): >>42576174 #

3. cube2222 ◴[02 Jan 25 16:52 UTC] No.42576085[source]▶

>>42576069 #

UUID v7 is what you numbered #2.

For the others, it’s best to read up on Wikipedia[0]. I believe they all have their unique use-cases and tradeoffs.

E.g. including information about which node of the system generated an ID.

[0]: https://en.m.wikipedia.org/wiki/Universally_unique_identifie...

4. mind-blight ◴[02 Jan 25 16:59 UTC] No.42576155[source]▶

>>42576069 #

A deterministic uuid based off of a hash of bits is also very useful (UUID5). I've used that for deduping records from multiple sources

5. elehack ◴[02 Jan 25 17:01 UTC] No.42576174{3}[source]▶

>>42576084 #

UUID v5 is quite useful if you want to deterministically convert external identifiers into UUIDS — define a namespace UUID for each potential identifier source (to keep them separate), then use that to derive a V5 UUID from the external identifier. It's very useful for idempotent data imports.

replies(1): >>42577132 #

6. chimpontherun ◴[02 Jan 25 17:03 UTC] No.42576193[source]▶

>>42576069 #

As it is usual in many areas of human endeavor, newcomers to the field tend to criticize design decisions that were made before them, only to re-invent what was already invented.

Sometimes it leads to improvements in the field, via rejection of the accumulated legacy crud, or just simply affording a new perspective. Most other times it's a well-intentioned, but low-effort noise.

I, personally, do it myself. This is how I learn.

7. ◴[02 Jan 25 17:04 UTC] No.42576202[source]▶

>>42575900 (OP) #

8. kingkilr ◴[02 Jan 25 17:05 UTC] No.42576212[source]▶

>>42575900 (OP) #

I would strongly implore people not to follow the example this post suggests, and write code that relies on this monotonicity.

The reason for this is simple: the documentation doesn't promise this property. Moreover, even if it did, the RFC for UUIDv7 doesn't promise this property. If you decide to depend on it, you're setting yourself up for a bad time when PostgreSQL decides to change their implementation strategy, or you move to a different database.

Further, the stated motivations for this, to slightly simplify testing code, are massively under-motivating. Saving a single line of code can hardly be said to be worth it, but even if it were, this is a problem far better solved by simply writing a function that will both generate the objects and sort them.

As a profession, I strongly feel we need to do a better job orienting ourselves to the reality that our code has a tendency to live for a long time, and we need to optimize not for "how quickly can I type it", but "what will this code cost over its lifetime".

replies(9): >>42576251 #>>42576272 #>>42576300 #>>42576495 #>>42576752 #>>42576906 #>>42576998 #>>42586804 #>>42589145 #

9. fngjdflmdflg ◴[02 Jan 25 17:09 UTC] No.42576248[source]▶

>>42575900 (OP) #

>The Postgres patch solves the problem by repurposing 12 bits of the UUID’s random component to increase the precision of the timestamp down to nanosecond granularity [...]

>It makes a repeated UUID between processes more likely, but there’s still 62 bits of randomness left to make use of, so collisions remain vastly unlikely.

Does it? Even though the number of random bits has decreased, the time interval to create such a duplicate has also decreased, namely to an interval of one nanosecond.

replies(3): >>42576319 #>>42576330 #>>42577456 #

10. 3eb7988a1663 ◴[02 Jan 25 17:09 UTC] No.42576251[source]▶

>>42576212 #

I too am missing the win on this. It is breaking the spec, and does not seem like it offers a significant advantage. In the eventual event where you have a collection of UUID7 you are only ever going to be able to rely on the millisecond precision anyway.

replies(2): >>42576316 #>>42576582 #

11. ◴[02 Jan 25 17:10 UTC] No.42576256[source]▶

>>42575900 (OP) #

12. peterldowns ◴[02 Jan 25 17:11 UTC] No.42576272[source]▶

>>42576212 #

The test should do a set comparison, not an ordered list comparison, if it wants to check that the same 5 accounts were returned by the API. I think it's as simple as that.

The blogpost is interesting and I appreciated learning the details of how the UUIDv7 implementation works.

replies(1): >>42576341 #

13. dotdi ◴[02 Jan 25 17:13 UTC] No.42576293[source]▶

>>42575900 (OP) #

My org has been using ULID[0] extensively for a few years, and generally we've been quite happy with it. After initially dabbing with a few implementations, I reimplemented the spec in Kotlin, and this has been working out quite well for us. We will open-source our implementation in the following weeks.

ULID does specifically require generated IDs to be monotonically increasing as opposed to what the RFC for UUIDv7 states, which is a big deal IMHO.

[0]: https://github.com/ulid/spec

replies(2): >>42577689 #>>42578965 #

14. paulddraper ◴[02 Jan 25 17:14 UTC] No.42576300[source]▶

>>42576212 #

> Moreover, even if it did, the RFC for UUIDv7 doesn't promise this property.

Huh?

If the docs were to guarantee it, they guarantee it. Why are you looking for everything to be part of RFC UUIDv7?

Failure of logic.

replies(1): >>42577060 #

15. sbuttgereit ◴[02 Jan 25 17:16 UTC] No.42576316{3}[source]▶

>>42576251 #

You say it's breaking the spec, but is it?

From https://www.rfc-editor.org/rfc/rfc9562.html#name-uuid-versio...:

"UUIDv7 values are created by allocating a Unix timestamp in milliseconds in the most significant 48 bits and filling the remaining 74 bits, excluding the required version and variant bits, with random bits for each new UUIDv7 generated to provide uniqueness as per Section 6.9. Alternatively, implementations MAY fill the 74 bits, jointly, with a combination of the following subfields, in this order from the most significant bits to the least, to guarantee additional monotonicity within a millisecond:

   1.  An OPTIONAL sub-millisecond timestamp fraction (12 bits at
       maximum) as per Section 6.2 (Method 3).

   2.  An OPTIONAL carefully seeded counter as per Section 6.2 (Method 1
       or 2).

   3.  Random data for each new UUIDv7 generated for any remaining
       space."

Which the referenced "method 3" is:

"Replace Leftmost Random Bits with Increased Clock Precision (Method 3):

For UUIDv7, which has millisecond timestamp precision, it is possible to use additional clock precision available on the system to substitute for up to 12 random bits immediately following the timestamp. This can provide values that are time ordered with sub-millisecond precision, using however many bits are appropriate in the implementation environment. With this method, the additional time precision bits MUST follow the timestamp as the next available bit in the rand_a field for UUIDv7."

16. paulddraper ◴[02 Jan 25 17:16 UTC] No.42576319[source]▶

>>42576248 #

Depends if you think sub-millisecond locality is significant.

17. londons_explore ◴[02 Jan 25 17:17 UTC] No.42576330[source]▶

>>42576248 #

I could imagine that certain nanoseconds might be vastly more likely than other nanoseconds.

For example, imagine you have a router that sends network packets out at the start of each microsecond, synced to wall time.

Or the OS scheduler always wakes processes up on a millisecond timer tick or some polling loop.

Now, when those packets are received by a postgres server and processed, the time to do that is probably fairly consistent - meaning that X nanoseconds past the microsecond you probably get most records being created.

replies(1): >>42576494 #

18. vips7L ◴[02 Jan 25 17:18 UTC] No.42576341{3}[source]▶

>>42576272 #

Don’t you think that depends on what you’re guaranteeing in your api? If you’re guaranteeing that your api returns the accounts ordered you need to test for that. But I do agree in general that using a set is the correct move.

replies(1): >>42576711 #

19. lordofgibbons ◴[02 Jan 25 17:20 UTC] No.42576359[source]▶

>>42575900 (OP) #

What benefit does this have over something like Twitter's Snowflake, which can be used to generate distributed monotonically increasing IDs without synchronization?

We've been using an implementation of it in Go for many years in production without issues.

replies(1): >>42579121 #

20. purerandomness ◴[02 Jan 25 17:25 UTC] No.42576419{3}[source]▶

>>42576349 #

It comes off as a low-effort question that seems to try to evoke a reply from a peer, while it's the kind of question that is best answered by an LLM, Google, or Wikipedia.

replies(1): >>42576927 #

21. UltraSane ◴[02 Jan 25 17:32 UTC] No.42576494{3}[source]▶

>>42576330 #

But only one nanosecond slower or faster and you get another set of 4.611 billion billion random IDs. I think random variations in buffer depths and CPU speeds will easily introduce hundreds of nanoseconds of timing variations. syncing any two things to less than 1 nanosecond is incredibly hard and doesn't happen by accident.

replies(3): >>42576585 #>>42577223 #>>42593338 #

22. throw0101c ◴[02 Jan 25 17:32 UTC] No.42576495[source]▶

>>42576212 #

> […] code that relies on this monotonicity. The reason for this is simple: the documentation doesn't promise this property. Moreover, even if it did, the RFC for UUIDv7 doesn't promise this property.

The "RFC for UUIDv7", RFC 9562, explicitly mentions monotonicity in §6.2 ("Monotonicity and Counters"):

    Monotonicity (each subsequent value being greater than the last) is 
    the backbone of time-based sortable UUIDs. Normally, time-based UUIDs 
    from this document will be monotonic due to an embedded timestamp; 
    however, implementations can guarantee additional monotonicity via 
    the concepts covered in this section.

* https://datatracker.ietf.org/doc/html/rfc9562#name-monotonic...

In the UUIDv7 definition (§5.7) it explicitly mentions the technique that Postgres employs for rand_a:

    rand_a:
        12 bits of pseudorandom data to provide uniqueness as per
        Section 6.9 and/or optional constructs to guarantee additional 
        monotonicity as per Section 6.2. Occupies bits 52 through 63 
        (octets 6-7).

* https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-vers...

Note: "optional constructs to guarantee additional monotonicity". Pg makes use of that option.

replies(1): >>42576599 #

23. throw0101c ◴[02 Jan 25 17:40 UTC] No.42576582{3}[source]▶

>>42576251 #

> It is breaking the spec […]

As per a sibling comment, it is not breaking the spec. The comment in the Pg code even cites the spec that says what to do (and is quoted in the post):

     * Generate UUID version 7 per RFC 9562, with the given timestamp.
     *
     * UUID version 7 consists of a Unix timestamp in milliseconds (48
     * bits) and 74 random bits, excluding the required version and
     * variant bits. To ensure monotonicity in scenarios of high-
     * frequency UUID generation, we employ the method "Replace
     * LeftmostRandom Bits with Increased Clock Precision (Method 3)",
     * described in the RFC. […]

24. zamadatix ◴[02 Jan 25 17:41 UTC] No.42576585{4}[source]▶

>>42576494 #

The important part is the events in time aren't going to be as random as the actual random source. The chances of an actual collision remain low but the distribution of events over time is a weaker (in relative terms) source of random bits compared to proper "random" sources which won't have obvious bias at all.

replies(1): >>42577252 #

25. stonemetal12 ◴[02 Jan 25 17:42 UTC] No.42576599{3}[source]▶

>>42576495 #

>explicitly mentions monotonicity

>optional constructs

So it is explicitly mentioned in the RFC as optional, and Pg doesn't state that they guaranty that option. The point still stands, depending on optional behavior is a recipe for failure when the option is no longer taken.

replies(6): >>42576845 #>>42576990 #>>42577105 #>>42577134 #>>42577202 #>>42579068 #

26. pphysch ◴[02 Jan 25 17:49 UTC] No.42576697[source]▶

>>42575900 (OP) #

The naming of "rand_a" and "rand_b" in the spec is a bit misleading. They don't have to be generated randomly. I'm sure there's a historical reason for it.

"extra_" or "distinct_" would be a more accurate prefix for UUIDv7.

UUIDv7 is actually quite a flexible standard due to these two underspecified fields. I'm glad Postgres took advantage of that!

27. Too ◴[02 Jan 25 17:50 UTC] No.42576711{4}[source]▶

>>42576341 #

The test is a very strange example indeed. Is it testing the backend, the database or both? If the api was guaranteeing ordered values, pre-uuid7 the backend must have sorted them by other means before returning, making the test identical. If the backend is not guaranteeing order, that shouldn't be tested either.

28. braiamp ◴[02 Jan 25 17:54 UTC] No.42576752[source]▶

>>42576212 #

I don't think most people will heed this warning. I warned people in a programming forum that Python ordering of objects by insertion time was a implementation detail, because it's not guaranteed by any PEP [0]. I could literally write a PEP compliant Python interpreter and could blow up in someone's code because they rely on the CPython interpreter behavior.

[0]: https://mail.python.org/pipermail/python-dev/2017-December/1...

replies(2): >>42576959 #>>42576961 #

29. idconvict ◴[02 Jan 25 18:02 UTC] No.42576845{4}[source]▶

>>42576599 #

The "optional" portion is this part of the spec, not the time part.

> implementations can guarantee additional monotonicity via the concepts covered in this section

replies(1): >>42582910 #

30. mmerickel ◴[02 Jan 25 18:02 UTC] No.42576848[source]▶

>>42575900 (OP) #

Remember even if timestamps may be generated using a monotonically increasing value that does not mean they were committed in the same order to the database. It is an entirely separate problem if you are trying to actually determine what rows are "new" versus "previously seen" for things like cursor-based APIs and background job processing. This problem exists even with things like a serial/autoincrement primary key.

replies(1): >>42577477 #

31. sedatk ◴[02 Jan 25 18:06 UTC] No.42576906[source]▶

>>42576212 #

As a counter-argument, it will inevitably turn into a spec if it becomes widely-used enough.

What was that saying, like: “every behavior of software eventually becomes API”

replies(2): >>42576997 #>>42577344 #

32. a3w ◴[02 Jan 25 18:07 UTC] No.42576927{4}[source]▶

>>42576419 #

Well, a LLM could answer it or write total bullshit. But yes, Wikipedia or other research quick internet research will help generally. And excactly here, it will tell you that there are competing standards since we have competing use cases.

replies(1): >>42577002 #

33. kstrauser ◴[02 Jan 25 18:10 UTC] No.42576959{3}[source]▶

>>42576752 #

That definitely was true, and I use to jitter my code a little to deliberately find and break tests that depended on any particular ordering.

It's now explicitly documented to be true, and you can officially rely on it. From https://docs.python.org/3/library/stdtypes.html#dict:

> Changed in version 3.7: Dictionary order is guaranteed to be insertion order.

That link documents the Python language's semantics, not the behavior of any particular interpreter.

34. dragonwriter ◴[02 Jan 25 18:10 UTC] No.42576961{3}[source]▶

>>42576752 #

> I warned people in a programming forum that Python ordering of objects by insertion time was a implementation detail, because it's not guaranteed by any PEP

PEPs do not provide a spec for Python, they neither cover the initial base language before the PEP process started, nor were all subsequent language changes made through PEPs. The closest thing Python has to a cross-implementation standard is the Python Language Reference for a particular version, treating as excluded anything explicitly noted as a CPython implementation detail. Dictionaries being insertion-ordered went from a CPython implementation detail in 3.6 to guaranteed language feature in 3.7+.

35. arghwhat ◴[02 Jan 25 18:12 UTC] No.42576990{4}[source]▶

>>42576599 #

Relying on an explicitly documented implementation behavior that the specification explicitly describes as an option is not an issue. Especially if the behavior is only relied on in a test, where the worst outcome is a failed testcase that is easily fixed.

Even if the behavior went away, UUIDs unlike serials can always be safely generated directly by the application just as well as they can be generated by the database.

Going straight for that would arguably be the "better" path, and allows mocking PRNG to get sequential IDs.

36. tomstuart ◴[02 Jan 25 18:13 UTC] No.42576997{3}[source]▶

>>42576906 #

https://www.hyrumslaw.com/

replies(1): >>42578430 #

37. deadbabe ◴[02 Jan 25 18:13 UTC] No.42576998[source]▶

>>42576212 #

Most code does not live for a long time. Similar to how consumer products are built for planned obsolescence, code is also built with a specific lifespan in mind.

If you spend time making code bulletproof so it can run for like 100 years, you will have wasted a lot of effort for nothing when someone comes along and wipes it clean and replaces it with new code in 2 years. Requirements change, code changes, it’s the nature of business.

Remember any fool can build a bridge that stands, it takes an engineer to make a bridge that barely stands.

replies(3): >>42577137 #>>42577156 #>>42581141 #

38. kraftman ◴[02 Jan 25 18:13 UTC] No.42577002{5}[source]▶

>>42576927 #

For simple questions like this with unambiguous answers, it is statistically very unlikely that you'll get a bullshit answer from an LLM.

39. fwip ◴[02 Jan 25 18:18 UTC] No.42577060{3}[source]▶

>>42576300 #

Their next sentence explains. Other databases might not make that guarantee, including future versions of Postgres.

40. mlyle ◴[02 Jan 25 18:22 UTC] No.42577105{4}[source]▶

>>42576599 #

It's mentioned in the RFC as being explicitly monotonic based the time-based design.

Implementations that need monotonicity beyond the resolution of a timestamp-- like when you allocate 30 UUIDs at one instant in a batch-- can optionally use those additional bits for that purpose.

> Implementations SHOULD employ the following methods for single-node UUID implementations that require batch UUID creation or are otherwise concerned about monotonicity with high-frequency UUID generation.

(And it goes on to recommend the obvious things you'd do: use a counter in those bits when assigning a batch; use more bits of time precision; etc.)

The comment in PostgreSQL before the implementation makes it clear that they chose the third option for this in the RFC:

     * variant bits. To ensure monotonicity in scenarios of high-
     * frequency UUID generation, we employ the method "Replace
     * LeftmostRandom Bits with Increased Clock Precision (Method 3)",
     * described in the RFC. ...

replies(1): >>42579731 #

41. jandrewrogers ◴[02 Jan 25 18:24 UTC] No.42577132{4}[source]▶

>>42576174 #

Both UUIDv3 and UUIDv5 are prohibited for some use cases in some countries (including the US), which is something to be aware of. Unfortunately, no one has created an updated standard UUID that uses a hash function that is not broken. While useful it is not always an option.

replies(1): >>42585785 #

42. sbuttgereit ◴[02 Jan 25 18:24 UTC] No.42577134{4}[source]▶

>>42576599 #

Software is arbitrary. Any so-called "guarantee" is only as good as the developers and organizations maintaining a piece of software want to make it regardless of prior statements. At some point, the practical likelihood of a documented, but not guaranteed, process being violated vs. the willful abandonment of a guarantee start to look very similar.... at which point nothing saves you.

Sometimes the best you can do is recognize who you're working with today, know how they work, and be prepared for those people to be different in the future (or of a different mind) and for things to change regardless to expressed guarantees.

....unless we're talking about the laws of physics... ...that's different...

43. Pxtl ◴[02 Jan 25 18:24 UTC] No.42577137{3}[source]▶

>>42576998 #

Uh, more people work on 20-year-old codebases than you'd think.

replies(1): >>42578908 #

44. agilob ◴[02 Jan 25 18:26 UTC] No.42577156{3}[source]▶

>>42576998 #

>Most code does not live for a long time.

Sure, and here I am in a third company doing cloud migration and changing our default DB from MySQL to SQL server. The pain is real, 2 year long roadmap is now 5 years longer roadmap. All because some dude negotiated a discount on cloud services. And we still develop integrations that talk to systems written for DOS.

45. throw0101c ◴[02 Jan 25 18:30 UTC] No.42577202{4}[source]▶

>>42576599 #

> So it is explicitly mentioned in the RFC as optional […]

The use of rand_a for extra monotonicity is optional. The monotonicity itself is not optional.

§5.7 states:

    Alternatively, implementations MAY fill the 74 bits, 
    jointly, with a combination of the following subfields, 
    in this order from the most significant bits to the least, 
    to guarantee additional monotonicity within a millisecond:

Guaranteeing additional monotonicity means that there is already a 'base' level of monotonicity, and there are provisions for even more ("additional") levels of it. This 'base level' is why §6.2 states:

    Monotonicity (each subsequent value being greater than the last) is 
    the backbone of time-based sortable UUIDs. Normally, time-based UUIDs 
    from this document will be monotonic due to an embedded timestamp; 
    however, implementations can guarantee additional monotonicity via 
    the concepts covered in this section.

"Backbone of time-based sortable UUIDs"; "additional monotonicity". Additional: adding to what's already there.

* https://datatracker.ietf.org/doc/html/rfc9562

replies(3): >>42579633 #>>42579694 #>>42582047 #

46. mlyle ◴[02 Jan 25 18:31 UTC] No.42577223{4}[source]▶

>>42576494 #

We're not talking about nanoseconds of real time; we're talking about nanoseconds as measured by the CPU doing the processing. Nanoseconds are not likely to be a uniform variate.

replies(1): >>42577302 #

47. UltraSane ◴[02 Jan 25 18:34 UTC] No.42577252{5}[source]▶

>>42576585 #

I am sure there is bias but 1 nanosecond is an incredibly narrow window. It really would be an interesting experiment to evaluate the optimal balance of bits for timestamp and for random value. What about hostname and even process ID? Snowflake IDs are 63 bits long with 41 bits as a millisecond timestamp, 10 bits as a machine ID, and 12 bits as a sequential counter.

replies(2): >>42578585 #>>42579410 #

48. UltraSane ◴[02 Jan 25 18:38 UTC] No.42577302{5}[source]▶

>>42577223 #

Yes and they are also not likely to be so non-uniform that more than 6.411 billion billion events all happen in one nanosecond.

replies(1): >>42577523 #

49. the8472 ◴[02 Jan 25 18:43 UTC] No.42577344{3}[source]▶

>>42576906 #

Consider the incentives you're setting up there. An API contract goes both ways, the vendor promises some things and not others to preserve flexibility, and the user has to abide by it to not get broken in the future. If you unilaterally ignore the contract, even plan to do so in advance, then eventually kindness and capacity to accommodate such abuse will run might run out and they may switch to an adversarial stance. See QUIC for example which is a big middle finger to middle boxes.

replies(1): >>42578450 #

50. michaelt ◴[02 Jan 25 18:54 UTC] No.42577456[source]▶

>>42576248 #

Imagine if you were generating 16 UUIDs per nanosecond, every nanosecond.

According to [1] due to the birthday paradox, the probability of a collision in any given nanosecond would be 3E−17 which of course sounds pretty low

But there are 3.154e+16 nanoseconds in a year - and if you get out your high-precision calculator, it'll tell you there's a 61.41% chance of a collision in a year.

Of course you might very well say "Who needs 16 UUIDs per nanosecond anyway?"

[1] https://www.bdayprob.com/

replies(1): >>42579069 #

51. shalabhc ◴[02 Jan 25 18:56 UTC] No.42577477[source]▶

>>42576848 #

What could be useful here is if postgres provided a way to determine the latest frozen uuid. This could be a few ms behind the last committed uuid but should guarantee that no new rows will land before the frozen uuid. Then we can use a single cursor track previously seen.

52. mlyle ◴[02 Jan 25 19:01 UTC] No.42577523{6}[source]▶

>>42577302 #

Note it's not that number, but roughly the square root of that number, that matters.

And they might be quite non-uniform. If the scheduler tick and the nanosecond clock are synchronous, you could end up with a few thousand popular values instead of a billion.

It's not a real concern today, and probably won't be a real concern in 10 years, but it's not so far removed from possibility that no one has to think about it.

replies(1): >>42577649 #

53. willvarfar ◴[02 Jan 25 19:08 UTC] No.42577613[source]▶

>>42575900 (OP) #

Ordering for UUIDv7s in the same millisecond is super useful when some rows represent actions and others reactions.

I have used this guarantee for events generated on clients. It really simplifies a lot of reasoning.

replies(1): >>42582259 #

54. UltraSane ◴[02 Jan 25 19:12 UTC] No.42577649{7}[source]▶

>>42577523 #

Good point about the square root of the random part. I guess that is why the 63 bit Snowflake ID uses a sequential counter.

55. willvarfar ◴[02 Jan 25 19:15 UTC] No.42577689[source]▶

>>42576293 #

Having used a lot of the ULID variants that the UUIDv7 spec cites as prior art, including the ULID spec you link to, I've gotta say that UUIDv7 has some real advantages.

The biggest advantage is that it is hex. Haven't yet met a database system that doesn't have functions for substr and from_hex etc, meaning you can extract the time part using vanilla sql.

ULID and others that use custom variants of base32 or base62 or whatever are just about impossible to wrangle with normal tooling.

Your future selfs will thank you for being able to manipulate it in whatever database you use in the future to analyse old logs or import whatever data you generate today.

replies(2): >>42577865 #>>42579005 #

56. Glyptodon ◴[02 Jan 25 19:31 UTC] No.42577844[source]▶

>>42575900 (OP) #

On one hand I too am looking forward to more widespread use of UUIDv7, but on the other I don't really get the problem this is solving for their spec. If you care about timestamp ordering I don't think doing it in a way that forces you to fake a PK if you insert an earlier dated record at a future point makes sense. But I guess I'm implicitly assuming that human meaningful dates differ from insertion times in many domains.

57. mixmastamyk ◴[02 Jan 25 19:33 UTC] No.42577865{3}[source]▶

>>42577689 #

Aren't they stored as 16 bytes in binary? How to format it later as text is then your choice.

replies(1): >>42578939 #

58. samatman ◴[02 Jan 25 19:34 UTC] No.42577882[source]▶

>>42575900 (OP) #

I maintain that people are too eager to use UUIDv7 to begin with. It's a dessert topping and a floor wax.

Let's say you need an opaque unique handle, and a timestamp, and a monotonically increasing row ID. Common enough. Do they have to be the same thing? Should they be the same thing? Because to me that sounds like three things: an autoincrementing primary key, a UUIDv4, and a nanosecond timestamp.

Is it always ok that the 'opaque' unique ID isn't opaque at all, that it's carrying around a timestamp? Will that allow correlating things which maybe you didn't want hostiles to correlate? Are you 100% sure that you'll never want, or need, to re-timestamp data without changing its global ID?

Maybe you do need these things unnormalized and conflated. Do you though? At least ask the question.

replies(3): >>42579206 #>>42581078 #>>42581841 #

59. scrollaway ◴[02 Jan 25 19:50 UTC] No.42578052[source]▶

>>42575900 (OP) #

UUID7 is excellent.

I want to share a django library I wrote a little while back which allows for prefixed identity fields, in the same style as Stripe's ID fields (obj_XXXXXXXXX):

https://github.com/jleclanche/django-prefixed-identity-field...

This gives a PrefixedIdentityField(prefix="obj_"), which is backed by uuid7 and base58. In the database, the IDs are stored as UUIDs, which makes them an efficient field -- they are transformed into prefixed IDs when coming out of the database, which makes them perfect for APIs.

(I know, no documentation .. if someone wants to use this, feel free to file issues to ask questions, I'd love to help)

60. treve ◴[02 Jan 25 20:09 UTC] No.42578276{3}[source]▶

>>42576349 #

I didn't downvote you, but the terseness made it for me immediately come off as a kind of criticism, e.g.: "Why would we ever need it". May not have been your intent but if it was a genuine question, form matters.

replies(1): >>42579530 #

61. sedatk ◴[02 Jan 25 20:24 UTC] No.42578430{4}[source]▶

>>42576997 #

Yes, that one! Thanks :)

62. sedatk ◴[02 Jan 25 20:25 UTC] No.42578450{4}[source]▶

>>42577344 #

Sure, there is a risk. But, it all depends on how great and desirable the benefits are.

63. WorldMaker ◴[02 Jan 25 20:38 UTC] No.42578585{6}[source]▶

>>42577252 #

Similarly for direct comparison, ULID has 48-bit timestamps, also at the millisecond, and 80 random bits.

Also to compare, the ULID spec technique for monotonicity is to take a single random value and then start incrementing the lowest bits, trading random entropy for direct "nearness", one after another. Versus the rand_a approach is effectively using the most significant bits, but keeping more random entropy.

64. 9dev ◴[02 Jan 25 21:06 UTC] No.42578908{4}[source]▶

>>42577137 #

And yet these people are dwarved by the number of developers crunching out generic line of business CRUD apps every day.

65. WorldMaker ◴[02 Jan 25 21:09 UTC] No.42578939{4}[source]▶

>>42577865 #

It's that eternal push/pull "war" between "we need a sproc that can report this directly from the SQL server" and "please don't do things directly on the SQL server because you'll route around important application code" and "it's a feature not a bug that you can't just look things up by ID in the DB without a little bit of extra work".

I did work on a project using ULIDs in SQL Server. They were stored in uniqueidentifier fields with a complex byte swap from ULID to fake-UUID to get better storage/indexing out of SQL Server [1]. There was an attempt to use SQL Functions to display/search the ULID form directly in the database, but it was never as bug free as the C# byte order code and so it was definitely not recommended doing it directly in the DB and that if a "report" was missing it should be a part of the application (which was already almost nothing but a bloated "reporting" tool) or in a related "configuration" application. It did feel more like a feature than a bug because it did keep some meddling and drama out of the DB. I also see the arguments for why in some different types of applications it makes debugging a lot harder and those arguments make sense and it is definitely a trade-off to consider.

[1] The rabbit hole into SQL Server's ancient weird UUID sort order: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...

replies(1): >>42588706 #

66. sedatk ◴[02 Jan 25 21:11 UTC] No.42578965[source]▶

>>42576293 #

ULID guarantees monotonicity only per process, and it requires ID generation to be serialized. I find the promise quite misleading because of that. You might as well use a wide-enough integer with the current timestamp + random as baseline for the same purpose, but I wouldn't recommend that either.

67. sedatk ◴[02 Jan 25 21:15 UTC] No.42579005{3}[source]▶

>>42577689 #

Additionally, v7 UUIDs can be generated simultaneously on the client-side by multiple threads without waiting for an oracle to release the next available ID. That's quite important for parallel processing. Otherwise, you might as well use an autoincrement BIGINT.

68. btown ◴[02 Jan 25 21:22 UTC] No.42579068{4}[source]▶

>>42576599 #

I was recently bit doing a Postgres upgrade by the Postgres team considering statements like `select 1 group by true` fine to silently break in Postgres 15. See https://postgrespro.com/list/thread-id/2661353 - and this behavior remains undocumented in https://www.postgresql.org/docs/release/ . It's an absolutely incredible project, and I don't disagree with the decision to classify it as wontfix - but it's an anecdote to not rely on undefined behavior!

69. Horffupolde ◴[02 Jan 25 21:22 UTC] No.42579069{3}[source]▶

>>42577456 #

So what if there’s a collision? If the column is UNIQUE at most it’ll ROLLBACK on INSERT. 16 INSERTS per nanosecond is 16 billion TPS. At that scale you’ll have other problems.

70. WorldMaker ◴[02 Jan 25 21:27 UTC] No.42579121[source]▶

>>42576359 #

UUIDv7 interoperates with all the other versions of UUID. The v7 support in Postgres doesn't add a new column type, it makes the existing column type more powerful/capable. Applications that had been using UUIDv4 everywhere can get cheap Snowflake-like benefits in existing code just from switching the generator function. Most languages have a GUID or UUID class/struct that is compatibly upgradable from v4 to v7, too.

replies(1): >>42589954 #

71. nikisweeting ◴[02 Jan 25 21:30 UTC] No.42579149[source]▶

>>42575900 (OP) #

I implemented this in pure Python a few days ago in case anyone finds it helpful, here it is: https://gist.github.com/pirate/7e44387c12f434a77072d50c52a3d...

My implementation supports graceful degradation between nanosecond scale resolution, microsecond, and millisecond, by using 12 bits for each and filling up the leftmost bits of rand_a and rand_b. Not all environments provide high resolution system clocks with no drift, so it's is important to maintain monotonicity when generating IDs with a low-res timestamp as input. You still want the bits that would've held the nanosecond value to be monotonic.

Neither of the existing uuid_utils and uuid7 python libs that can generate UUID7s support this monotonicity property.

Am planning on using this for ArchiveBox append-only "snapshot" records, which are intrinsically linked to time, so it's a good use-case imo.

There's another great resource here that I think is one of the best explainers of UUIDv7: https://antonz.org/uuidv7/

Whatever you do, don't implement the cursed 36-bit whole-second based time UUIDv7 variant that you occasionally see on StackOverflow / blog posts, stick to 48!

replies(1): >>42581870 #

72. user3939382 ◴[02 Jan 25 21:36 UTC] No.42579206[source]▶

>>42577882 #

Re-timestamp would be a new one for me. What’s a conceivable use case? An NTP fault?

73. zamadatix ◴[02 Jan 25 22:01 UTC] No.42579410{6}[source]▶

>>42577252 #

I suppose that would depend entirely on how you measure what optimal is. Optimal randomness is 128 bits from the best random source and 0 bits from anything else, like time. Optimal "just random enough for my use case but no more so I can fit other information in the value" depends entirely on the requirement of your use case (more specifically, not just "for databases" but "for my database to... on the hardware... in which the access is... on the presumed growth..." and so on). For picking a "good enough" value 12 bits is probably as reasonable as one will find generic reason for.

replies(1): >>42582283 #

74. urronglol ◴[02 Jan 25 22:14 UTC] No.42579530{4}[source]▶

>>42578276 #

Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

Please don't fulminate. Please don't sneer, including at the rest of the community.

Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

replies(2): >>42579865 #>>42586819 #

75. reshlo ◴[02 Jan 25 22:26 UTC] No.42579633{5}[source]▶

>>42577202 #

> Normally, time-based UUIDs from this document will be monotonic due to an embedded timestamp; however, implementations can guarantee additional monotonicity via the concepts covered in this section.

“Normally, I am at home because I do not have a reason to go out; however, sometimes I am at home because I am sleeping.”

Notice how this statement does not actually mean that I am always at home.

76. Dylan16807 ◴[02 Jan 25 22:32 UTC] No.42579694{5}[source]▶

>>42577202 #

"this monotonicity" that OP suggests people not use is specifically the additional monotonicity.

Or to put it another way: OP is suggesting you don't depend on it being properly monotonic, because the default is that it is only partially monotonic.

77. Dylan16807 ◴[02 Jan 25 22:36 UTC] No.42579731{5}[source]▶

>>42577105 #

> It's mentioned in the RFC as being explicitly monotonic based the time-based design.

It's explicitly partially monotonic.

Or as other people would call it, "not monotonic".

People are talking past each other based on their use of the word "monotonic".

replies(2): >>42581561 #>>42585566 #

78. mtmail ◴[02 Jan 25 22:49 UTC] No.42579865{5}[source]▶

>>42579530 #

From the same guidelines "Please don't comment about the voting on comments. It never does any good, and it makes boring reading." treve gave insight into their thought process when they read your initial comment, and I had the same reaction. Neither treve nor I downvoted it.

79. Dylan16807 ◴[02 Jan 25 22:57 UTC] No.42579936[source]▶

>>42575900 (OP) #

> The Postgres patch solves the problem by repurposing 12 bits of the UUID’s random component to increase the precision of the timestamp down to nanosecond granularity (filling rand_a above), which in practice is too precise to contain two UUIDv7s generated in the same process.

A millisecond divided by 4096 is not a nanosecond. It's about 250 nanoseconds.

80. dfee ◴[02 Jan 25 22:59 UTC] No.42579949[source]▶

>>42575900 (OP) #

I have an implementation function that computes N v7 UUIDs, sorts them, and returns them. This makes testing possible.

    Collection<UUID> generate(final int count);

I also have an interface that I can back with a RNG that generates auto incrementing values, sorts for testing, I have the experience of ints, but for production, my non-timestamp component is random.

81. wslh ◴[02 Jan 25 23:13 UTC] No.42580089[source]▶

>>42575900 (OP) #

This post makes me think (keep thinking) if we can use a solution that I used for another project in another context: using a Cryptographic Feistel Network to compute UUIDS so they are reversible if you need to know the original order. Each entity uses another key for the generation but if they know the keys they can know the order of the other party. Basically is using an existing cryptographic function if the block size is the same and if not adaping it to a specific block size via a Feistel Network.

82. fastball ◴[03 Jan 25 01:16 UTC] No.42581078[source]▶

>>42577882 #

Also if you have a nanosecond timestamp, do you actually need a monotonically increasing row ID? What for?

replies(1): >>42581967 #

83. mardifoufs ◴[03 Jan 25 01:25 UTC] No.42581141{3}[source]▶

>>42576998 #

What? Okay, so assume that most code doesn't last. It doesn't mean that you should purposefully make it brittle for basically no additional benefit? If as you say, it's about making the most with as little as possible (which is what the bridge analogy usually refers to), then surely adding a single function (to actually enforce the ordering you want) to make your code more robust is one of the best examples of that?

84. mlyle ◴[03 Jan 25 02:23 UTC] No.42581561{6}[source]▶

>>42579731 #

It's explicitly monotonic, except for apps that have a very fast ID rate, in which case there are recommended approaches (the word "SHOULD" is used in an RFC) to make it work. And PostgreSQL used one of these recommended approaches and documented it.

replies(1): >>42582207 #

85. peferron ◴[03 Jan 25 03:01 UTC] No.42581841[source]▶

>>42577882 #

You can keep all three and still use UUIDv7 as a performance improvement in certain contexts due to data locality.

86. tomComb ◴[03 Jan 25 03:05 UTC] No.42581870[source]▶

>>42579149 #

This looks great, thanks. But I think gists are better for unimportant stuff - for this I think it deserves its own repo.

replies(1): >>42582237 #

87. peferron ◴[03 Jan 25 03:17 UTC] No.42581967{3}[source]▶

>>42581078 #

Perhaps as a tie breaker if you insert multiple rows in a table within one transaction? In this situation, the timestamp returned by e.g. `now()` refers to the start of the transaction, which can cause it to be reused multiple times.

replies(1): >>42582590 #

88. Dylan16807 ◴[03 Jan 25 03:53 UTC] No.42582207{7}[source]▶

>>42581561 #

> It's explicitly monotonic, except for apps that have a very fast ID rate

"might generate two IDs in the same millisecond" is not a very exotic occurrence. It makes a big difference whether the rest is guaranteed or not.

> And PostgreSQL used one of these recommended approaches and documented it.

Well that's the center of the issue, right? OP's interpretation was that PostgreSQL did not document such, and so it shouldn't be relied upon. If it is a documented promise, then things are fine.

But is it actually in the documentation? A source code comment saying it uses a certain method isn't a promise that it will always use that method.

replies(2): >>42582895 #>>42582914 #

89. nikisweeting ◴[03 Jan 25 03:58 UTC] No.42582237{3}[source]▶

>>42581870 #

It's in the ArchiveBox git repo and I may give it it's own library eventually, but for quick linking it's easier to read / less dependent on the rest of that codebase as a standalone script.

90. fastball ◴[03 Jan 25 04:57 UTC] No.42582590{4}[source]▶

>>42581967 #

I meant in a context with a random ID and a timestamp.

If you need to retrieve rows by time/order, you use the timestamp. If you need a row by ID, you use the ID.

The use-cases where you actually need to know which row was inserted first seem exceedingly rare (mostly financial / auditing applications), and even then can probably be handled with separate transactions (as you touch on).

91. dragonwriter ◴[03 Jan 25 05:47 UTC] No.42582895{8}[source]▶

>>42582207 #

> Well that’s the center of the issue, right? OP’s interpretation was that PostgreSQL did not document such, and so it shouldn’t be relied upon.

And the correct answer is…we don’t know. We have a commit that landed and the explanation of the commit; we don’t have the documentation for the corresponding release of Postgres, because…it doesn’t exist yet. Because monoticity is an important feature for UUIDv7, it would be very odd if Postgres used an implementation that took the extra effort to use a nanosecond-level time value as the high-order portion of the variant part of the UUID instead of the minimum millisecond-level time, but not document that, but any assumption about what will be the documented, reliable-going-forward advertised feature is speculative until the documentation exists and is finalized.

OTOH, its perfectly fine to talk about what the implementation allows now, because that kind of thing is important to the decision about what should be documented and committed to going forward.

92. dragonwriter ◴[03 Jan 25 05:50 UTC] No.42582910{5}[source]▶

>>42576845 #

The “time part” is actually two different parts: the required millisecond-level ordering and the optional use of part of rand_a (which postgres does) to provide higher-resolution (nanosecond, in the postgres case) time ordering when combined with the required portion.

So, no, the “time part” of the postgres implementation is, in part, one of the options discussed in the spec, not merely the “time part” required in the spec.

93. pests ◴[03 Jan 25 05:50 UTC] No.42582914{8}[source]▶

>>42582207 #

The point of the extra bits is to allow the application developer to keep monotonicity in the "not very exotic occurrence" scenario. The purpose is to be monotonic. I feel like you are missing the core concept.

replies(1): >>42583278 #

94. hardwaresofton ◴[03 Jan 25 06:10 UTC] No.42583009[source]▶

>>42575900 (OP) #

Been waiting for UUIDv7 for years -- maybe it's time to archive pg_idkit[0], or maybe instead just switch the UUIDv7 version to do the native thing rather than the Rust code.

[0]: https://github.com/VADOSWARE/pg_idkit

95. Dylan16807 ◴[03 Jan 25 07:03 UTC] No.42583278{9}[source]▶

>>42582914 #

I'm not missing anything, the problem is a lot of people using sloppy wording or mixing up the two modes.

This comment thread is about the guaranteed level of monotonicity. Yes, those bits exist. But you can't depend on them from something that only promises "UUIDv7". You need an additional promise that it's configured that way and actually using those bits to maintain monotonicity.

96. throw0101c ◴[03 Jan 25 13:42 UTC] No.42585566{6}[source]▶

>>42579731 #

> It's explicitly partially monotonic.

From the RFC:

    Monotonicity (each subsequent value being greater than the last) is 
    the backbone of time-based sortable UUIDs. Normally, time-based UUIDs 
    from this document will be monotonic due to an embedded timestamp

"time-based UUIDs from this document will be monotonic". "will be monotonic".

I'm not sure how much more explicit this can be made.

The intent of UUIDv7 is monotonicity. If an implementation is not monotonic then it's a bug in the implementation.

97. kbolino ◴[03 Jan 25 14:11 UTC] No.42585785{5}[source]▶

>>42577132 #

Could you provide an example of such a prohibition? I've never heard of that before.

I doubt that the quality of the hash function is the real issue. The problem with MD5 and SHA1 is that it's easy (for MD5) and technically possible (for SHA1) to generate collisions. That makes them broken for enforcing message integrity. But a UUID is not an integrity check. Both MD5 and SHA1 are still very good as non-cryptographic hash functions. While a hash-based UUID provides obfuscation, it isn't really a security mechanism.

Even the existence of UUIDv5 feels like a knee-jerk reaction from when MD5 was "bad" but SHA1 was still "good". No hash function will protect you against de-obfuscation of low-entropy inputs. I can feed your social security number through SHA3-512 but it's not going to make it any less guessable than if I fed it through MD5.

Moreover, a UUID only has 122 bits of usable space. Even if we defined a new SHA2- or SHA3-based UUID version, it's still going to have to truncate the hash output to less than half of its full size. This significantly alters the security properties of the hash function, though I'm not sure if much cryptanalysis has been done on the shorter forms to see if they're more practically breakable yet.

There is one area where the collision resistance of the hash function could be a concern, though. If all of the inputs to the hash are under the control of a potential attacker, then maliciously constructed data could produce the same UUID. I still wouldn't think this would be a major issue, since most databases will fail to insert a duplicate key, but it might allow for various denial of service attacks. This still feels like quite a niche risk, though, and very circumstance-dependent.

replies(1): >>42593244 #

98. drbojingle ◴[03 Jan 25 16:08 UTC] No.42586804[source]▶

>>42576212 #

In enterprise land. In proof of concept land that's not quite true (but does become true if the concept works)

99. treve ◴[03 Jan 25 16:10 UTC] No.42586819{5}[source]▶

>>42579530 #

I'm not arguing with you, but I'm giving you feedback on your communication style, which is even with this comment still completely lacking.

100. kagitac ◴[03 Jan 25 17:10 UTC] No.42587402[source]▶

>>42575900 (OP) #

After reading this I went ahead and added the extra 12 bits to my Go UUID library. Thanks for the write up on the PostgreSQL patch.

If anyone is interested, here is the package: https://github.com/cmackenzie1/go-uuid. It also includes a CLI similar to that of `uuidgen`, but supports version 7.

101. WorldMaker ◴[03 Jan 25 19:30 UTC] No.42588706{5}[source]▶

>>42578939 #

Also, depending on your DB engine and DB design and storage needs it might be just fine to store ULID as `char(26)` instead of `uniqueidentifier`. It's a lot more space in bytes, but bytes can be pretty affordable and then the ULIDs are never not in their canonical Base-32 form.

I also worked on applications that used ULIDs in string form only in NoSQL documents, string-based cache keys, and string indexes just fine. I didn't try `char(26)` columns in a DB and seeing how well, for instance, SQL Server clustered and indexed them, but I've SQL Server do just fine with someone's wild idea of clustering a `varhar(MAX)` field and I'm sure SQL Server can probably do it just fine on the technical side.

It's nice that you can easily convert a ULID to a 128-bit key, but you certainly don't have to. (Also, people really do like the ugly dashed hex form of UUIDs sometimes and I've seen people store those directly as strings in places you'd expect they should just store the 128-bit value, it goes both ways here, I suppose.)

102. StackTopherFlow ◴[03 Jan 25 20:25 UTC] No.42589145[source]▶

>>42576212 #

I agree, optimizing for readability and maintainability is almost always the right choice.

103. akvadrako ◴[03 Jan 25 22:08 UTC] No.42589954{3}[source]▶

>>42579121 #

Snowflake is a 64 bit integer. It doesn't need a new column type and works everywhere.

104. jandrewrogers ◴[04 Jan 25 07:51 UTC] No.42593244{6}[source]▶

>>42585785 #

Systems where a sophisticated attacker may engineer collisions are precisely why UUIDv3/5 are prohibited. SHA1 is deemed broken by some government authorities and not to be used in any critical systems, including as UUID (this is where I’ve seen it expressly prohibited). The entire point of UUIDs in many systems is that collisions should be impossible, system integrity is predicated on it. Many systems exist in a presumptively adversarial environment.

Similarly, UUIDv4 is also prohibited in many contexts because people using weak entropy sources has been a recurring problem in real systems. It isn’t a theoretical issue, it has actually happened repeatedly. Decentralized generation of UUIDv4 is not trusted because humans struggle to implement it correctly, causing collisions where none are expected.

There are also contexts where probabilistic collision resistance is disallowed because collision probabilities, while low, are high enough to be theoretically plausible. Most people aren’t working on systems this large yet.

Ironically, there are many reasonable ways to construct reasonable and secure 128-bit identity values but the standards don’t define one. Some flavor of deterministic generation + encryption are not uncommon but they are also non-standard.

That said, many companies unavoidably have a mix of standard and non-standard UUIDs internally. To mitigate collisions, they have to transform those UUIDs into something else UUID-like, at which point it is pretty much guaranteed to be non-standard. Not ideal but that is the world we live in.

replies(1): >>42602563 #

105. deepsun ◴[04 Jan 25 08:16 UTC] No.42593338{4}[source]▶

>>42576494 #

System doesn't give you 1 nanosecond precision really. It varies by OS+hardware, from what I remember you may get like 641 nanosecond precision. The number is nanoseconds, yes, but you never get "next nanosecond". In other words, every number you get has the same mod 641. In other systems you may get like 40,000 precision or even worse on SPECTRE/MELTDOWN-protected systems.

106. kbolino ◴[05 Jan 25 15:52 UTC] No.42602563{7}[source]▶

>>42593244 #

Ok, that makes sense. As far as I can tell, even truncated to "just" 122 bits, there's still no known way to generate a SHA-256 collision, so the MD5/SHA1 versions are comparatively vulnerable vs an hypothetical SHA256 UUID version. However, it's starting to feel like UUIDs may not be long enough in general to meet the need for secure, distributed ID generation.

↑