(In theory at least,) kernel should take care of aggregating write blocks and those will be plenty big enough by the time it reaches target drive, all thanks to the very same page cache GP is talking about - unless you specify "oflag=direct" to dd.
That being said, probably don't use too small of a block-size - this will eat up CPU in system call overhead and slow down copy regardless of target media type.
I don't know how internals of "cp" and the related machinery interacts with the target drive, however if you don't provide bs=1024kB, dd writes with extremely small units (1 byte at a time IIRC), which overwhelms the flash controller and creates a high CPU load at the same time.
I always used dd since it provides more direct control over the transfer stream and how it's transported. I also call dd as "direct-drive" sometimes due to these capabilities of the tool.
"By specification, its default 512 block size has had to remain unchanged for decades. Today, this tiny size makes it CPU bound by default. A script that doesn’t specify a block size is very inefficient, and any script that picks the current optimal value may slowly become obsolete — or start obsolete if it’s copied from "
While I remembered the default wrong (because I never used the defaults, and I was too lazy to look for it during writing the comment), it's possible for a script to get a correct block size every time.
There are ways to get block size of a device. Multiply it by 2 to 4 (or more), open it directly, and keep your device busy.
The blog post is oblivious to nuances about the issue and usefulness of "dd" in general.
Please forgive the nit-picking, I'm not attacking this (excellent) article, or your entirely sensible inclination to dig up some "physical" number, but...
With modern SSDs, "sector/block size" is rapidly approaching vagueness of cylinder/head/sector addressing scheme, as used a couple of decades ago on venerable spinning/magnetic disks.
That is - it is definitely a thing, somewhere deep down, but software running on host CPU trying to address those, wouldn't necessarily end up addressing the same thing as user had in mind.
If you want a concrete example - look no further than "SLC mode" cache - where drive would have a number of identical flash chips, but some of them (or even a dynamically-allocated fractional number of chips) would be run at lower bits per cell count, for higher speed/endurance. However erase- and write- blocksize for the chip is expressed in cells, not bits. What that means is - cache and main storage of the very same SSD would have different blocksize (in bits/bytes).
> Please forgive the nit-picking, I'm not attacking this (excellent) article, or your entirely sensible inclination to dig up some "physical" number, but...
I don't think it's nitpicking. We're discussing there. We're technical people, and we tend to point out different aspects/perspectives of a problem, and offer our opinions. That's something I love when it's done in a civilized manner.
Regarding to remaining part of your comment (I didn't want to quote to not make it look crowded), I kindly disagree.
The beauty of SSDs are they have a controller which fits into the definition of black magic, and all flash is abstracted behind that, but not completely. Hard drives also are almost in the same realm.
Running a simple "smartctl -a /dev/sdX" returns a line like the following:
This means I can bang it with 512 byte packs and it'll handle it fine, but the physical cell (or sector) size is different, which is 4kB. I have another SSD, again from same manufacturer, which reports:
Sector Size: 512 bytes logical/physical
So, I can dd it and it'll just handle it just fine, but the first one needs a bs=4kB to minimize write-amplification and maximize speed.
This is completely same with USB flash drives. Higher end drives will provide full SMART (since they're bona-fide SSDs), but lower end ones are not that talkative. Nevertheless, a common denominator block size (like 1024kB, because drives also can be composed of huge, 512kB cells, too) allows any drive to divide the chunk to physical flash sector sizes optimally, and push data to it
In the SLC/xLC hybrid drives' case, controller does the math to minimize the write amplification, but again having a perfect multiple of reported physical sector size makes controller's work much easier, and makes things way smoother. Either because the reported physical size is for the SLC part which is you're hitting for the most cases, or the controller is already handling multi-level logistics inside the flash array (but thinking in terms of block sizes since the this is how it works on the bus side regardless of the case inside).
That being said, probably don't use too small of a block-size - this will eat up CPU in system call overhead and slow down copy regardless of target media type.