Most active commenters
  • bastawhiz(6)
  • evanelias(4)
  • crazygringo(3)

←back to thread

264 points davidgomes | 17 comments | | HN request time: 0.001s | source | bottom
Show context
paulryanrogers ◴[] No.41875055[source]
Upgrades are hard. There was no replication in the before times. The original block-level replication didn't work among different major versions. Slony was a painful workaround based on triggers that amplified writes.

Newer PostgreSQL versions are better. Yet still not quite as robust or easy as MySQL.

At a certain scale even MySQL upgrades can be painful. At least when you cannot spare more than a few minutes of downtime.

replies(7): >>41875126 #>>41876174 #>>41876232 #>>41876375 #>>41877029 #>>41877268 #>>41877959 #
api ◴[] No.41875126[source]
I've always wondered why Postgres is so insanely popular. I mean it has some nice things like very powerful support for a very comprehensive subset of SQL functionality, but most apps don't need all that.

It really feels like early 1990s vintage Unix software. It's clunky and arcane and it's hard to feel confident doing anything complex with it.

replies(9): >>41875168 #>>41875240 #>>41875306 #>>41875587 #>>41876346 #>>41876770 #>>41877119 #>>41877951 #>>41878394 #
tpmoney ◴[] No.41875306[source]
> I've always wondered why Postgres is so insanely popular.

In no particular order, my preference for postgres is driven by:

  * Date / time functions that don't suck
  * UTF-8 is really UTF-8
  * 99% of a backup can be done live with nothing more than rsyncing the data directory and the WAL files
  * Really comprehensive documentation
  * LTREE and fuzzy string match extensions
  * Familiarity from using it for years
MySQL/Maria I'm sure is fine, but it's one of hose things where it's just different enough and I haven't encountered a compelling use case for changing my preference.
replies(3): >>41875336 #>>41876296 #>>41879736 #
fhdsgbbcaA ◴[] No.41875336[source]
UTF-8 is what made me switch. It’s insane MySQL has something called UTF-8 that isn't really UTF-8, but do have a type UTF8MB4 that actually is correct. This means if you use UFT-8 in MySQL, you can’t use emoji for example.
replies(3): >>41875430 #>>41876285 #>>41878032 #
1. bastawhiz ◴[] No.41875430[source]
And the fact that adding real utf-8 support limited (limits?) the length of strings that can be indexed
replies(1): >>41875764 #
2. evanelias ◴[] No.41875764[source]
Postgres limits btree keys to 2704 bytes, which is actually slightly smaller than MySQL's limit of 3072 bytes, assuming the default InnoDB storage engine.

That said, when using utf8mb4 in an index key, MySQL uses the "worst case" of each character being 4 bytes. So it effectively limits the max key size to 3072/4 = 768 characters, when a column is using the utf8mb4 character set.

For practical purposes, this doesn't cause much pain, as it's generally inadvisable to use complete long-ish strings as a key. And there are various workarounds, like using prefixes or hashes as the key, or using binary strings as keys to get the full 3072 bytes (if you don't need collation behaviors).

replies(1): >>41875901 #
3. bastawhiz ◴[] No.41875901[source]
> So it effectively limits the max key size to 3072/4 = 768 characters, when a column is using the utf8mb4 character set.

This is exactly what I mean. 768 characters for an index is woefully bad. And for no obviously great reason: you can just index the encoded UTF-8 text.

This was literally reason why a former company (who will remain nameless) refused to add Unicode support. It's not even an imagined problem.

replies(2): >>41876001 #>>41876347 #
4. evanelias ◴[] No.41876001{3}[source]
Why did you need to index fairly long strings in their entirety in a way that preserves collation behaviors?

And why is a 768 character limit woefully bad, but a 2704 character limit is totally fine?

replies(2): >>41876259 #>>41876450 #
5. bastawhiz ◴[] No.41876259{4}[source]
A URL, for instance, can't be safely stored in 768 characters, but it can be stored safely in 2704. If you then wanted to sort those URLs so that all URLs for each domain and path within that domain are adjacent, you need an index. Especially if you want to paginate over them with a cursor. Doing that without an index on the raw value is a royal pain in the ass.

Hell, even just being able to sort user-submitted strings up to a kilobyte. Why up to a kilobyte? Some users have strings that are kind of long. If I have to define a second column that's the truncated prefix, that's just a silly waste of space because MySQL decided to use utf-32 under the hood.

replies(3): >>41876474 #>>41876664 #>>41876823 #
6. sgarland ◴[] No.41876347{3}[source]
You should not be indexing 768 characters in any circumstance I can imagine. Go ahead and try it. Spin up two tables, fill them with a few million rows, and slap and index on them. Give one a reasonable prefix limit, and let the other go wild. Make sure you ANALYZE each, then run queries in a loop and check the times.

Spoiler: I literally did this a couple of days ago. The index size bloat means that any possible savings you might have gained from collisions are obliterated from page fetches. I tested with a measly 128 characters vs. a prefix of 16, and that was enough for the average query time to be equal, with the smaller index winning for the minimum.

7. fweimer ◴[] No.41876450{4}[source]
You might just load someone else's data, and the index is desirable in general for speeding up analytic queries. It's possible to work around that, of course. But depending on what you do, it can make writing efficient queries against the data more difficult. That's just a distraction because most of the time, those long columns won't matter anyway.
8. evanelias ◴[] No.41876474{5}[source]
> A URL, for instance

VARBINARY is typically ok for that I'd think? Then you can utilize the full 3072 byte limit for the key, since there's no character set applied.

> even just being able to sort user-submitted strings up to a kilobyte

As a software engineer, I completely agree. But as a DBA, I am obligated to make a "tsk" sound and scowl disapprovingly!

9. crazygringo ◴[] No.41876664{5}[source]
To be honest, indexes aren't designed for that. They're meant for fast lookup of short identifiers. Things like people's names and product ID's. Not long URL's. It's not performant.

If you need to keep a million long URL's in a defined sort order, my first recommendation would be, don't -- see if there's another way to achieve your end result. But if you absolutely have to, then create a new integer column to be your sort key, and use a little bit of extra code to give it values that produce the same sort order.

Creating short numerical primary keys for long strings is a common database technique.

replies(1): >>41880646 #
10. The_Colonel ◴[] No.41876823{5}[source]
> it can be stored safely in 2704

No, it can't. URL doesn't have any length limit, regardless of the fact that different software will impose different limits.

replies(1): >>41880612 #
11. bastawhiz ◴[] No.41880612{6}[source]
Browser address bars have a limit of 2048, so if that's your use case, yes it's safe.
replies(1): >>41885721 #
12. bastawhiz ◴[] No.41880646{6}[source]
> indexes aren't designed for that. They're meant for fast lookup of short identifiers. Things like people's names and product ID's. Not long URL's. It's not performant.

This is objectively false. If this was true, indexes wouldn't serve range queries. You couldn't index on dates. You couldn't sort numbers.

> But if you absolutely have to, then create a new integer column to be your sort key, and use a little bit of extra code to give it values that produce the same sort order.

This fails when you need to insert new values into the table. Then you not only need to figure out the new integer value (how, if you can't efficiently compare sorted string values???), you need to update all the integers to make room.

replies(1): >>41881416 #
13. crazygringo ◴[] No.41881416{7}[source]
Sorry, I was considering short things like dates and numbers as identifiers. I realize that's not quite right -- what I should have said was that indexes are designed for short things period (short identifiers being one of those things). Thanks.

> This fails when you need to insert new values into the table.

Yes, that's part of the extra code you need to keep the values accurately sorted. There are a lot of different particular code solutions that might work -- whether allowing for collisions and re-ordering every night with a cron job, or putting large gaps between numbers, or using floats.

But my main point stands, which is that standard relational databases are not designed to be able to maintain a sorted index of long URL's out of the box. Indexes aren't meant for that and they won't work, and this is by design. You're going to have to roll your own code for that.

Fortunately I've never come across a case in the wild where maintaining a globally sorted list of long items was required (though I'm not saying they never exist). E.g. if you're building a spider that needs to match against URL's, you'd index a short hash of the URL as a non-unique index. Or if you wanted to display sorted URL's for a site, you'd index by domain name only, and then sort the remainder of the URL at query time.

replies(1): >>41883759 #
14. bastawhiz ◴[] No.41883759{8}[source]
> But my main point stands, which is that standard relational databases are not designed to be able to maintain a sorted index of long URL's out of the box.

You keep saying that, but Postgres does a great job with no issues without any extra work. MySQL is alone in being suboptimal. "It's not designed for that" isn't a good answer, if it works great. Show me how the underlying data structures fail or perform poorly if it's really not something you should do.

replies(2): >>41883900 #>>41884084 #
15. crazygringo ◴[] No.41883900{9}[source]
No, Postgres doesn't. 2730 bytes is not long enough to hold all URL's encountered in the wild. But also, your performance will suffer if you use that whole length. You generally don't want to be doing that.

The difference between MySQL and Postgres here is negligible. It doesn't matter exactly where you define the limit of a short field, except it should probably be able to hold a maximum length filename which is 255 characters, plus some room to spare. Both MySQL and Postgres do this fine.

16. evanelias ◴[] No.41884084{9}[source]
> MySQL is alone in being suboptimal.

It's only suboptimal if you choose the wrong column type for the task at hand. For storing URLs, you almost certainly don't want collation behaviors, such as accent insensitivity or case insensitivity. So VARBINARY is a better choice here anyway.

And as several other commenters have mentioned, at large scale, indexing a bunch of long URLs in b-trees is indeed a bad practice performance-wise in any relational database. You won't be able to fit many entries per page, so read performance will be slow, especially for range scans.

In that situation it's almost always better to use a non-unique index over a prefix (if you need sorting and range scans) or a hash (if you don't), and disambiguate collisions by having the full value in an unindexed column. And/or split the URL up between the domain name and path in separate columns. If needed, normalize the domain names into a separate table so that the URL table can refer to them by numeric ID. etc. All depends on the specific use-case.

17. The_Colonel ◴[] No.41885721{7}[source]
Safari has 80 000, Firefox 65K.

There are plenty of needs to store URLs which will never go through a browser.

You can only claim that "some URL use cases" can be stored in 2048 characters.