> You have to take a contiguous block of 2M, wipe the entire thing, rewrite whatever part of it was useful,
This is called garbage collection. It may be happening at any time in the background but becomes more frequent and perhaps in the write path as the drive fills. 2M is an example size - the actual size will vary by drive model and it is rarely disclosed.
> Modern high end SSDs have various ways of dealing with this like a RAM cache, a SLC cache and extra reserved space to always have some spare room, but there are still limits.
There are multiple types of SLC cache as well. Client drives may have a small number of gigabytes of SLC that can absorb a small burst of writes. Client drives may also have pseudo SLC (pSLC) that is called pSLC, TurboWrite (Samsung), etc. With pSLC, when there's about 30% of the drive's NAND that is erased, the drive will use that as SLC. So, a 1 TB drive will use about 300 GB of space as a 100 GB SLC cache.
How performance degrades as the drive fills varies widely between drive models. Some drives start to have significant read and write performance degradation long before the drive is half written. Others will maintain fairly consistent read performance (maybe within 90% of that which is advertised) regardless of how full the drive is. For instance, the original version of the Samsung 980 Pro maintains close-to-spec read speeds regardless of how full it is but write performance drops from about 5200 MB/s to about 1300(?) MB/s the moment it hits 70% allocated.
Datacenter and enterprise class drives tend to have lower peak performance than client drives, but their performance is much more consistent regardless of how full they are.
If you are buying a client NVMe drive for speed, buy one that is larger than you need and set aside at least 30% of it in unpartitioned (or unused partition) space. This will prevent the OS from writing to 30% of the drive thus keeping plenty of space for pSLC and similar optimizations. This will also increase the life of the drive as garbage collection is likely to have to rewrite the same data less frequently, resulting in a lower write amplification factor.[1]
This is called garbage collection. It may be happening at any time in the background but becomes more frequent and perhaps in the write path as the drive fills. 2M is an example size - the actual size will vary by drive model and it is rarely disclosed.
> Modern high end SSDs have various ways of dealing with this like a RAM cache, a SLC cache and extra reserved space to always have some spare room, but there are still limits.
There are multiple types of SLC cache as well. Client drives may have a small number of gigabytes of SLC that can absorb a small burst of writes. Client drives may also have pseudo SLC (pSLC) that is called pSLC, TurboWrite (Samsung), etc. With pSLC, when there's about 30% of the drive's NAND that is erased, the drive will use that as SLC. So, a 1 TB drive will use about 300 GB of space as a 100 GB SLC cache.
How performance degrades as the drive fills varies widely between drive models. Some drives start to have significant read and write performance degradation long before the drive is half written. Others will maintain fairly consistent read performance (maybe within 90% of that which is advertised) regardless of how full the drive is. For instance, the original version of the Samsung 980 Pro maintains close-to-spec read speeds regardless of how full it is but write performance drops from about 5200 MB/s to about 1300(?) MB/s the moment it hits 70% allocated.
Datacenter and enterprise class drives tend to have lower peak performance than client drives, but their performance is much more consistent regardless of how full they are.
If you are buying a client NVMe drive for speed, buy one that is larger than you need and set aside at least 30% of it in unpartitioned (or unused partition) space. This will prevent the OS from writing to 30% of the drive thus keeping plenty of space for pSLC and similar optimizations. This will also increase the life of the drive as garbage collection is likely to have to rewrite the same data less frequently, resulting in a lower write amplification factor.[1]
1. See Over Provisioning at https://semiconductor.samsung.com/consumer-storage/magician/
> This is what TRIM (confusingly also known as 'discard' in some contexts)
But wait, there's more terms for the same concept: trim (ATA), unmap (SCSI), deallocate (NVMe) are interace-specific ways that Linux performs discard.
> Run `fstrim /mountpoint`
Or if there is no filesystem, `blkdiscard /dev/nvmeXn1[pY]`.