For the others, it’s best to read up on Wikipedia[0]. I believe they all have their unique use-cases and tradeoffs.
E.g. including information about which node of the system generated an ID.
[0]: https://en.m.wikipedia.org/wiki/Universally_unique_identifie...
Sometimes it leads to improvements in the field, via rejection of the accumulated legacy crud, or just simply affording a new perspective. Most other times it's a well-intentioned, but low-effort noise.
I, personally, do it myself. This is how I learn.
The reason for this is simple: the documentation doesn't promise this property. Moreover, even if it did, the RFC for UUIDv7 doesn't promise this property. If you decide to depend on it, you're setting yourself up for a bad time when PostgreSQL decides to change their implementation strategy, or you move to a different database.
Further, the stated motivations for this, to slightly simplify testing code, are massively under-motivating. Saving a single line of code can hardly be said to be worth it, but even if it were, this is a problem far better solved by simply writing a function that will both generate the objects and sort them.
As a profession, I strongly feel we need to do a better job orienting ourselves to the reality that our code has a tendency to live for a long time, and we need to optimize not for "how quickly can I type it", but "what will this code cost over its lifetime".
>It makes a repeated UUID between processes more likely, but there’s still 62 bits of randomness left to make use of, so collisions remain vastly unlikely.
Does it? Even though the number of random bits has decreased, the time interval to create such a duplicate has also decreased, namely to an interval of one nanosecond.
The blogpost is interesting and I appreciated learning the details of how the UUIDv7 implementation works.
ULID does specifically require generated IDs to be monotonically increasing as opposed to what the RFC for UUIDv7 states, which is a big deal IMHO.
Huh?
If the docs were to guarantee it, they guarantee it. Why are you looking for everything to be part of RFC UUIDv7?
Failure of logic.
From https://www.rfc-editor.org/rfc/rfc9562.html#name-uuid-versio...:
"UUIDv7 values are created by allocating a Unix timestamp in milliseconds in the most significant 48 bits and filling the remaining 74 bits, excluding the required version and variant bits, with random bits for each new UUIDv7 generated to provide uniqueness as per Section 6.9. Alternatively, implementations MAY fill the 74 bits, jointly, with a combination of the following subfields, in this order from the most significant bits to the least, to guarantee additional monotonicity within a millisecond:
1. An OPTIONAL sub-millisecond timestamp fraction (12 bits at
maximum) as per Section 6.2 (Method 3).
2. An OPTIONAL carefully seeded counter as per Section 6.2 (Method 1
or 2).
3. Random data for each new UUIDv7 generated for any remaining
space."
Which the referenced "method 3" is:"Replace Leftmost Random Bits with Increased Clock Precision (Method 3):
For UUIDv7, which has millisecond timestamp precision, it is possible to use additional clock precision available on the system to substitute for up to 12 random bits immediately following the timestamp. This can provide values that are time ordered with sub-millisecond precision, using however many bits are appropriate in the implementation environment. With this method, the additional time precision bits MUST follow the timestamp as the next available bit in the rand_a field for UUIDv7."
For example, imagine you have a router that sends network packets out at the start of each microsecond, synced to wall time.
Or the OS scheduler always wakes processes up on a millisecond timer tick or some polling loop.
Now, when those packets are received by a postgres server and processed, the time to do that is probably fairly consistent - meaning that X nanoseconds past the microsecond you probably get most records being created.
We've been using an implementation of it in Go for many years in production without issues.
The "RFC for UUIDv7", RFC 9562, explicitly mentions monotonicity in §6.2 ("Monotonicity and Counters"):
Monotonicity (each subsequent value being greater than the last) is
the backbone of time-based sortable UUIDs. Normally, time-based UUIDs
from this document will be monotonic due to an embedded timestamp;
however, implementations can guarantee additional monotonicity via
the concepts covered in this section.
* https://datatracker.ietf.org/doc/html/rfc9562#name-monotonic...In the UUIDv7 definition (§5.7) it explicitly mentions the technique that Postgres employs for rand_a:
rand_a:
12 bits of pseudorandom data to provide uniqueness as per
Section 6.9 and/or optional constructs to guarantee additional
monotonicity as per Section 6.2. Occupies bits 52 through 63
(octets 6-7).
* https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-vers...Note: "optional constructs to guarantee additional monotonicity". Pg makes use of that option.
As per a sibling comment, it is not breaking the spec. The comment in the Pg code even cites the spec that says what to do (and is quoted in the post):
* Generate UUID version 7 per RFC 9562, with the given timestamp.
*
* UUID version 7 consists of a Unix timestamp in milliseconds (48
* bits) and 74 random bits, excluding the required version and
* variant bits. To ensure monotonicity in scenarios of high-
* frequency UUID generation, we employ the method "Replace
* LeftmostRandom Bits with Increased Clock Precision (Method 3)",
* described in the RFC. […]
>optional constructs
So it is explicitly mentioned in the RFC as optional, and Pg doesn't state that they guaranty that option. The point still stands, depending on optional behavior is a recipe for failure when the option is no longer taken.
"extra_" or "distinct_" would be a more accurate prefix for UUIDv7.
UUIDv7 is actually quite a flexible standard due to these two underspecified fields. I'm glad Postgres took advantage of that!
[0]: https://mail.python.org/pipermail/python-dev/2017-December/1...
It's now explicitly documented to be true, and you can officially rely on it. From https://docs.python.org/3/library/stdtypes.html#dict:
> Changed in version 3.7: Dictionary order is guaranteed to be insertion order.
That link documents the Python language's semantics, not the behavior of any particular interpreter.
PEPs do not provide a spec for Python, they neither cover the initial base language before the PEP process started, nor were all subsequent language changes made through PEPs. The closest thing Python has to a cross-implementation standard is the Python Language Reference for a particular version, treating as excluded anything explicitly noted as a CPython implementation detail. Dictionaries being insertion-ordered went from a CPython implementation detail in 3.6 to guaranteed language feature in 3.7+.
Even if the behavior went away, UUIDs unlike serials can always be safely generated directly by the application just as well as they can be generated by the database.
Going straight for that would arguably be the "better" path, and allows mocking PRNG to get sequential IDs.
If you spend time making code bulletproof so it can run for like 100 years, you will have wasted a lot of effort for nothing when someone comes along and wipes it clean and replaces it with new code in 2 years. Requirements change, code changes, it’s the nature of business.
Remember any fool can build a bridge that stands, it takes an engineer to make a bridge that barely stands.
Implementations that need monotonicity beyond the resolution of a timestamp-- like when you allocate 30 UUIDs at one instant in a batch-- can optionally use those additional bits for that purpose.
> Implementations SHOULD employ the following methods for single-node UUID implementations that require batch UUID creation or are otherwise concerned about monotonicity with high-frequency UUID generation.
(And it goes on to recommend the obvious things you'd do: use a counter in those bits when assigning a batch; use more bits of time precision; etc.)
The comment in PostgreSQL before the implementation makes it clear that they chose the third option for this in the RFC:
* variant bits. To ensure monotonicity in scenarios of high-
* frequency UUID generation, we employ the method "Replace
* LeftmostRandom Bits with Increased Clock Precision (Method 3)",
* described in the RFC. ...
Sometimes the best you can do is recognize who you're working with today, know how they work, and be prepared for those people to be different in the future (or of a different mind) and for things to change regardless to expressed guarantees.
....unless we're talking about the laws of physics... ...that's different...
Sure, and here I am in a third company doing cloud migration and changing our default DB from MySQL to SQL server. The pain is real, 2 year long roadmap is now 5 years longer roadmap. All because some dude negotiated a discount on cloud services. And we still develop integrations that talk to systems written for DOS.
The use of rand_a for extra monotonicity is optional. The monotonicity itself is not optional.
§5.7 states:
Alternatively, implementations MAY fill the 74 bits,
jointly, with a combination of the following subfields,
in this order from the most significant bits to the least,
to guarantee additional monotonicity within a millisecond:
Guaranteeing additional monotonicity means that there is already a 'base' level of monotonicity, and there are provisions for even more ("additional") levels of it. This 'base level' is why §6.2 states: Monotonicity (each subsequent value being greater than the last) is
the backbone of time-based sortable UUIDs. Normally, time-based UUIDs
from this document will be monotonic due to an embedded timestamp;
however, implementations can guarantee additional monotonicity via
the concepts covered in this section.
"Backbone of time-based sortable UUIDs"; "additional monotonicity". Additional: adding to what's already there.According to [1] due to the birthday paradox, the probability of a collision in any given nanosecond would be 3E−17 which of course sounds pretty low
But there are 3.154e+16 nanoseconds in a year - and if you get out your high-precision calculator, it'll tell you there's a 61.41% chance of a collision in a year.
Of course you might very well say "Who needs 16 UUIDs per nanosecond anyway?"
What could be useful here is if postgres provided a way to determine the latest frozen uuid. This could be a few ms behind the last committed uuid but should guarantee that no new rows will land before the frozen uuid. Then we can use a single cursor track previously seen.
And they might be quite non-uniform. If the scheduler tick and the nanosecond clock are synchronous, you could end up with a few thousand popular values instead of a billion.
It's not a real concern today, and probably won't be a real concern in 10 years, but it's not so far removed from possibility that no one has to think about it.
I have used this guarantee for events generated on clients. It really simplifies a lot of reasoning.
The biggest advantage is that it is hex. Haven't yet met a database system that doesn't have functions for substr and from_hex etc, meaning you can extract the time part using vanilla sql.
ULID and others that use custom variants of base32 or base62 or whatever are just about impossible to wrangle with normal tooling.
Your future selfs will thank you for being able to manipulate it in whatever database you use in the future to analyse old logs or import whatever data you generate today.
Let's say you need an opaque unique handle, and a timestamp, and a monotonically increasing row ID. Common enough. Do they have to be the same thing? Should they be the same thing? Because to me that sounds like three things: an autoincrementing primary key, a UUIDv4, and a nanosecond timestamp.
Is it always ok that the 'opaque' unique ID isn't opaque at all, that it's carrying around a timestamp? Will that allow correlating things which maybe you didn't want hostiles to correlate? Are you 100% sure that you'll never want, or need, to re-timestamp data without changing its global ID?
Maybe you do need these things unnormalized and conflated. Do you though? At least ask the question.
I want to share a django library I wrote a little while back which allows for prefixed identity fields, in the same style as Stripe's ID fields (obj_XXXXXXXXX):
https://github.com/jleclanche/django-prefixed-identity-field...
This gives a PrefixedIdentityField(prefix="obj_"), which is backed by uuid7 and base58. In the database, the IDs are stored as UUIDs, which makes them an efficient field -- they are transformed into prefixed IDs when coming out of the database, which makes them perfect for APIs.
(I know, no documentation .. if someone wants to use this, feel free to file issues to ask questions, I'd love to help)
Also to compare, the ULID spec technique for monotonicity is to take a single random value and then start incrementing the lowest bits, trading random entropy for direct "nearness", one after another. Versus the rand_a approach is effectively using the most significant bits, but keeping more random entropy.
I did work on a project using ULIDs in SQL Server. They were stored in uniqueidentifier fields with a complex byte swap from ULID to fake-UUID to get better storage/indexing out of SQL Server [1]. There was an attempt to use SQL Functions to display/search the ULID form directly in the database, but it was never as bug free as the C# byte order code and so it was definitely not recommended doing it directly in the DB and that if a "report" was missing it should be a part of the application (which was already almost nothing but a bloated "reporting" tool) or in a related "configuration" application. It did feel more like a feature than a bug because it did keep some meddling and drama out of the DB. I also see the arguments for why in some different types of applications it makes debugging a lot harder and those arguments make sense and it is definitely a trade-off to consider.
[1] The rabbit hole into SQL Server's ancient weird UUID sort order: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...
My implementation supports graceful degradation between nanosecond scale resolution, microsecond, and millisecond, by using 12 bits for each and filling up the leftmost bits of rand_a and rand_b. Not all environments provide high resolution system clocks with no drift, so it's is important to maintain monotonicity when generating IDs with a low-res timestamp as input. You still want the bits that would've held the nanosecond value to be monotonic.
Neither of the existing uuid_utils and uuid7 python libs that can generate UUID7s support this monotonicity property.
Am planning on using this for ArchiveBox append-only "snapshot" records, which are intrinsically linked to time, so it's a good use-case imo.
There's another great resource here that I think is one of the best explainers of UUIDv7: https://antonz.org/uuidv7/
Whatever you do, don't implement the cursed 36-bit whole-second based time UUIDv7 variant that you occasionally see on StackOverflow / blog posts, stick to 48!
Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.
When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."
Please don't fulminate. Please don't sneer, including at the rest of the community.
Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
“Normally, I am at home because I do not have a reason to go out; however, sometimes I am at home because I am sleeping.”
Notice how this statement does not actually mean that I am always at home.
Or to put it another way: OP is suggesting you don't depend on it being properly monotonic, because the default is that it is only partially monotonic.
It's explicitly partially monotonic.
Or as other people would call it, "not monotonic".
People are talking past each other based on their use of the word "monotonic".
A millisecond divided by 4096 is not a nanosecond. It's about 250 nanoseconds.
Collection<UUID> generate(final int count);
I also have an interface that I can back with a RNG that generates auto incrementing values, sorts for testing, I have the experience of ints, but for production, my non-timestamp component is random."might generate two IDs in the same millisecond" is not a very exotic occurrence. It makes a big difference whether the rest is guaranteed or not.
> And PostgreSQL used one of these recommended approaches and documented it.
Well that's the center of the issue, right? OP's interpretation was that PostgreSQL did not document such, and so it shouldn't be relied upon. If it is a documented promise, then things are fine.
But is it actually in the documentation? A source code comment saying it uses a certain method isn't a promise that it will always use that method.
If you need to retrieve rows by time/order, you use the timestamp. If you need a row by ID, you use the ID.
The use-cases where you actually need to know which row was inserted first seem exceedingly rare (mostly financial / auditing applications), and even then can probably be handled with separate transactions (as you touch on).
And the correct answer is…we don’t know. We have a commit that landed and the explanation of the commit; we don’t have the documentation for the corresponding release of Postgres, because…it doesn’t exist yet. Because monoticity is an important feature for UUIDv7, it would be very odd if Postgres used an implementation that took the extra effort to use a nanosecond-level time value as the high-order portion of the variant part of the UUID instead of the minimum millisecond-level time, but not document that, but any assumption about what will be the documented, reliable-going-forward advertised feature is speculative until the documentation exists and is finalized.
OTOH, its perfectly fine to talk about what the implementation allows now, because that kind of thing is important to the decision about what should be documented and committed to going forward.
So, no, the “time part” of the postgres implementation is, in part, one of the options discussed in the spec, not merely the “time part” required in the spec.
This comment thread is about the guaranteed level of monotonicity. Yes, those bits exist. But you can't depend on them from something that only promises "UUIDv7". You need an additional promise that it's configured that way and actually using those bits to maintain monotonicity.
From the RFC:
Monotonicity (each subsequent value being greater than the last) is
the backbone of time-based sortable UUIDs. Normally, time-based UUIDs
from this document will be monotonic due to an embedded timestamp
"time-based UUIDs from this document will be monotonic". "will be monotonic".I'm not sure how much more explicit this can be made.
The intent of UUIDv7 is monotonicity. If an implementation is not monotonic then it's a bug in the implementation.
I doubt that the quality of the hash function is the real issue. The problem with MD5 and SHA1 is that it's easy (for MD5) and technically possible (for SHA1) to generate collisions. That makes them broken for enforcing message integrity. But a UUID is not an integrity check. Both MD5 and SHA1 are still very good as non-cryptographic hash functions. While a hash-based UUID provides obfuscation, it isn't really a security mechanism.
Even the existence of UUIDv5 feels like a knee-jerk reaction from when MD5 was "bad" but SHA1 was still "good". No hash function will protect you against de-obfuscation of low-entropy inputs. I can feed your social security number through SHA3-512 but it's not going to make it any less guessable than if I fed it through MD5.
Moreover, a UUID only has 122 bits of usable space. Even if we defined a new SHA2- or SHA3-based UUID version, it's still going to have to truncate the hash output to less than half of its full size. This significantly alters the security properties of the hash function, though I'm not sure if much cryptanalysis has been done on the shorter forms to see if they're more practically breakable yet.
There is one area where the collision resistance of the hash function could be a concern, though. If all of the inputs to the hash are under the control of a potential attacker, then maliciously constructed data could produce the same UUID. I still wouldn't think this would be a major issue, since most databases will fail to insert a duplicate key, but it might allow for various denial of service attacks. This still feels like quite a niche risk, though, and very circumstance-dependent.
If anyone is interested, here is the package: https://github.com/cmackenzie1/go-uuid. It also includes a CLI similar to that of `uuidgen`, but supports version 7.
I also worked on applications that used ULIDs in string form only in NoSQL documents, string-based cache keys, and string indexes just fine. I didn't try `char(26)` columns in a DB and seeing how well, for instance, SQL Server clustered and indexed them, but I've SQL Server do just fine with someone's wild idea of clustering a `varhar(MAX)` field and I'm sure SQL Server can probably do it just fine on the technical side.
It's nice that you can easily convert a ULID to a 128-bit key, but you certainly don't have to. (Also, people really do like the ugly dashed hex form of UUIDs sometimes and I've seen people store those directly as strings in places you'd expect they should just store the 128-bit value, it goes both ways here, I suppose.)
Similarly, UUIDv4 is also prohibited in many contexts because people using weak entropy sources has been a recurring problem in real systems. It isn’t a theoretical issue, it has actually happened repeatedly. Decentralized generation of UUIDv4 is not trusted because humans struggle to implement it correctly, causing collisions where none are expected.
There are also contexts where probabilistic collision resistance is disallowed because collision probabilities, while low, are high enough to be theoretically plausible. Most people aren’t working on systems this large yet.
Ironically, there are many reasonable ways to construct reasonable and secure 128-bit identity values but the standards don’t define one. Some flavor of deterministic generation + encryption are not uncommon but they are also non-standard.
That said, many companies unavoidably have a mix of standard and non-standard UUIDs internally. To mitigate collisions, they have to transform those UUIDs into something else UUID-like, at which point it is pretty much guaranteed to be non-standard. Not ideal but that is the world we live in.