←back to thread

366 points virtualwhys | 7 comments | | HN request time: 0.249s | source | bottom
Show context
halayli ◴[] No.41899794[source]
This topic cannot be discussed alone without talking about disks. SSDs write 4k page at a time. Meaning if you're going to update 1 bit, the disk will read 4k, you update the bit, and it writes back a 4k page in a new slot. So the penalty for copying varies depending on the disk type.
replies(2): >>41900275 #>>41904085 #
1. srcreigh ◴[] No.41900275[source]
Postgres pages are 8kb so the point is moot.
replies(2): >>41901535 #>>41901808 #
2. olavgg ◴[] No.41901535[source]
The default is 8kb, but it can be recompiled for 4kb-32kb, I actually prefer 32kb because with ZSTD compression, it will most likey only use 8kb after being compressed. Average compress ratio with ZSTD, is usually between 4x-6x. But depending on how your compressable you data is, you may also get a lot less. Note that changing this block size, will require initialization of a new data file system for your Postgres database.
3. halayli ◴[] No.41901808[source]
I am referring to physical pages in an SSD disk. The 8k pg page maps to 2 pages in a typical SSD disk. Your comment proves my initial point, which is write amplification cannot be discussed without talking about the disk types and their behavior.
replies(2): >>41902116 #>>41903957 #
4. emptiestplace ◴[] No.41902116[source]
Huh? It seems you've forgotten that you were just saying that a single bit change would result in a 4096 byte write.
replies(1): >>41906755 #
5. mschuster91 ◴[] No.41903957[source]
> The 8k pg page maps to 2 pages in a typical SSD disk.

You might end up with even more than that due to filesystem metadata (inode records, checksums), metadata of an underlying RAID mechanism or, when working via some sort of networking, stuff like ethernet frame sizes/MTU.

In an ideal world, there would be a clear interface which a program can use to determine for any given combination of storage media, HW RAID, transport layer (local attach vs stuff like iSCSI or NFS), SW RAID (i.e. mdraid), filesystem and filesystem features what the most sensible minimum changeable unit is to avoid unnecessary write amplification bloat.

6. Tostino ◴[] No.41906755{3}[source]
> a single bit change would result in a 4096 byte write

On (most) SSD hardware, regardless of what software you are using to do the writes.

At least that's how I read their comment.

replies(1): >>41907050 #
7. emptiestplace ◴[] No.41907050{4}[source]
Right, and if pg writes 8192 bytes every time, this is no longer relevant.