It is an extension though so downside there is it not being included in most Postgres installs. I’ve used it at work and it felt like a superpower getting the benefits of a vacuum full without all the usual drama.
Short version is that VACUUM is needed to clean up dead tuples and reclaim disk space. For most cases with smaller amounts of data, auto-vacuum works totally fine. But I've had issues with tables with 100m+ rows that are frequently updated where auto-vacuum falls behind and stops working completely. These necessitated a full data dump + restore (because we didn't want to double our storage capacity to do a full vacuum). We fixed this by sharding the table and tweaking auto-vacuum to run more frequently, but this isn't stuff you have to worry about in MySQL.
Honestly if you're a small shop without database/postgres experts and MySQL performance is adequate for you, I wouldn't switch. Newer versions of MySQL have fixed the egregious issues, like silent data truncation on INSERT by default, and it's easier to maintain, in my experience.
PostgreSQL tables are known as heaps, which consist of slotted pages where new data is written to the first page with sufficient free space. Since it's not a b-tree and you can't resolve a row with just a primary key without a table scan, Postgres uses the physical location of the row called a tuple ID (TID, or item pointer) that contains the page and position (slot) of the row within that page. So the TID (10, 3) tells Postgres the row is in block 10 slot 3 which can be fetched directly from the page buffer or disk without having to do a tree traversal.
When PostgreSQL updates a row, it doesn’t modify the original data directly. Instead, it:
1) Writes a new version of the row to a new page
2) Marks the old row as outdated by updating its tuple header and relevant page metadata
3) Updates the visibility map to indicate that the page contains outdated rows
4) Adjusts indexes to point to the new TID of the updated row
This means that indexes need to be updated even if the column value didn't change.Old rows continue to accumulate in the heap until the VACUUM process permanently deletes them, but this process can impact normal operations and cause issues.
Overall this means Postgres does more disk I/O for the same work as MySQL. The upside is Postgres doesn't have to worry about page splits, so things like bulk inserts can be much more efficient.
We've been using https://github.com/dataegret/pgcompacttable to clean up bloat without impacting stability/performance as much as pg_repack does.
Too bad it's sorta annoying to do on plain old pg.
“Stops working completely” should not be a thing, it could be vacuuming slower than the update frequency (although that’d be surprising) but I don’t know of any reason it’d just stop?
That being said I’ve also had issues with autovac (on aurora to be fair, couldn’t say if it was aurora-specific) like it running constantly without vacuuming anything, like there was an old transaction idling (there wasn’t)
Aside: I wish Postgres forced to make explicit the lock taken. Make me write “TAKE LOCK ACCESS EXCLUSIVE VACUUM FULL my_table”, and fail if the lock I take is too weak. Implicit locks are such a massive footgun that have caused countless incidents across the world, it’s just bad design.
In any case, this got resolved but caused a huge operational headache, and isn't something that would have been a problem with MySQL. I feel like that's the main reason VACUUM gets hated on; all of the problems with it are solvable, but you only find those problems by running into them, and when you run into them on your production database it ends up somewhere between "pain in the ass" and "total nightmare" to resolve.
It continues to baffle me why AWS picks some truly terrible defaults for parameter groups. I understand most of them come from the RDBMS defaults, but AWS has the luxury of knowing precisely how many CPUs and RAM any given instance has. On any decently-sized instance, it should allocate far more memory for maintenance_work_mem, for example.
Not in the heap, but if you have any index on the table (I know, don’t do that for bulk loads, but many don’t / it isn’t feasible sometimes) then you’re still dealing with a B+tree (probably).
Also, MySQL still gets the nod for pure bulk load speed via MySQLShell’s Parallel Import Utility [0]. You can of course replicate this in Postgres by manually splitting the input file and running multiple \COPY commands, but having a tool do it all in one is lovely.
[0]: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-uti...
Absolutely, though they're generally orders of magnitude smaller than the table file unless you're INCLUDE'ing lots of columns.
There's pg_bulkload which supports parallel writers as well as deferred index updates until the loading process is complete. Not sure how it compares to what MySQL offers out of the box, but I definitely agree that the MySQL tooling ecosystem in general has a leg up.
In normal circumstances it's not needed. The unused space is reused for further inserts and if your database maintains steady rate of inserts/deletes, this won't be an issue.
However I experienced situation when database starts to grow uncontrollably and it was happening for weeks (when actual workload didn't change). I don't know what causes that behaviour. The solution was to run `vacuum full`.