Do you happen to know why the standard doesn’t use that trick? It always impress...

OskarS · on Jan 31, 2020

I think it's to avoid a branch in `size()`. If you use this trick, you have to check the flag if it's a small in-situ string or a "regular" heap-allocated string, because they they calculate size differently. In the standard library strings, the size is always just stored in the same position in memory, so `size()` has essentially no overhead when inlined, it's just a memory load.

Branches on their own aren't particularly expensive (especially not an easily predictable one like this), but they do screw with several compiler optimizations (auto-vectorization, for one). Given that, it seems like a reasonable decision, and the only way you'd know which one is faster is just testing it.

I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes, just to get longer strings using the SSO. The trade-off in the stack-space of the string seems like it would totally be worth it, and with 56 character long "small strings", a huge number of strings could fit in there (virtually all names, for instance). You probably couldn't do it for std::string (would wreck havoc with ABIs), but my hunch is that performance and memory usage might both benefit from the reduced memory allocations.

ot · on Jan 31, 2020

Yes there is some concern that it would affect code size and/or make some optimizations impossible, but on the latter I have yet to see hard data that that is the case. We're talking about adding about 1-2 cycles of latency.

Note that the branch can be implemented in a way that compilers will lower into a CMOV, making the code size issue almost moot. I implemented that in fbstring back in the day:

https://github.com/facebook/folly/commit/be4c6d6b3e21914df8a...

This is the entirety of size():

    0f b6 57 17             movzbl 0x17(%rdi),%edx
    b8 17 00 00 00          mov    $0x17,%eax
    48 29 d0                sub    %rdx,%rax
    48 0f 48 47 08          cmovs  0x8(%rdi),%rax

A bit more than a MOV, but not that much.

> I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes, just to get longer strings using the SSO

And libstdc++ is actually 32 bytes :( It's easy to template on the inline capacity, you can look at llvm::SmallVector or folly::small_vector, but vocabulary types should have the smallest possible footprint. Vast majority of instances are either empty, or very small (think keys in a map).

acqq · on Jan 31, 2020

> I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes

The "optimal" implementation would actually have different allocation sizes when used mostly for processing and when used mostly for storage: for processing one would prefer bigger defaults, for storage, one needs as little as possible initial overhead.

That could be achieved by raising the knowledge of that distinction and introducing actually (at least) two kind of strings. And even more than that can be gained by recognizing that whenever the strings are involved in some more complex structures, some additional restrictions for totality of all associated strings could also be exploited: allocating the strings separately is an overhead by itself.

If I remember correctly, Turbo Pascal strings had the default size of 256 bytes and that's how much was on the stack for each, and that was fast even decades ago. That was not efficient for having a lot of them in the structures, if a lot of them could be smaller, but that was what a programmer had to care separately - still the typical processing of strings and working with the limited number of them was nicely covered by that default.

Someone · on Jan 31, 2020

Mac OS similarly, over time, grew various short Pascal string types to be used in its APIs. https://opensource.apple.com/source/CarbonHeaders/CarbonHead...:

    typedef unsigned char Str255[256];
    typedef unsigned char Str63[64];
    typedef unsigned char Str32[33];
    typedef unsigned char Str31[32];
    typedef unsigned char Str27[28];
    typedef unsigned char Str15[16];

The 2^n-1 sizes are cache-line don’t need explanation, I think. 27 was the maximum length of a volume name in MFS, the original Macintosh file system (which was a strange mix of backwards (no directories) and forwards (255 character file names) thinking).

Str32 was used in AppleTalk (and probably a design error or bug; Str31 is a more ‘natural’ type)

gpderetta · on Jan 31, 2020

You shouldn't need a branch though. The size would always be the last element of the buffer. It doubles as the null terminator if the size is exactly 23. The only additional operation is adding the static offset to the stored value to get the actual size.

It is probably just a small win and not everybody wanted to pay for the extra complexity (SSO is already fairly complex).

Re your suggestion on large SSO, years ago, when doing document clustering, using large fixed size strings (64 chars if I remember correctly, longer strings were just truncated) was a huge win for me (even coping them around wasn't an issue). Not sure if it is appropriate for a general purpose string though.

edit: never mind, I see what you mean about having to check the size.

OskarS · on Jan 31, 2020

Yeah, exactly :) when in SSO mode, it stores the size as a delta occupying a single byte, but when in heap-allocated mode, it stores the size as a an 8-byte size_t, and you require a branch to check which case is the right one. You can see the branch here in the source [0], and godbolt confirms that for std::string it's just a memory load [1]. A shame Folly isn't available on Compiler Explorer, would be interesting to compare.

[0]: https://github.com/facebook/folly/blob/master/folly/FBString...

[1]: https://godbolt.org/z/U-LvGm

scottlamb · on Jan 31, 2020

Just a guess: the null byte is necessarily at the end of the 24 bytes rather than at the beginning. It might be in a different (typically 64-byte) cache line. Maybe one you don't need to read in from main RAM otherwise if you're handling a shorter string. That might make more of a performance difference than whether 23-byte strings (plus NUL byte) are short or long, depending on your application.

bregma · on Jan 31, 2020

> Do you happen to know why the standard doesn’t use that trick?

Many of the early implementations of `std::string` were COW (copy-on-write), which is not entirely compatible with SSO strings. In practice, COW strings proved to be troublesome and slow, and the SSO strings won out. The standard was changed to effectively mandate SSO strings and forbid COW strings.

bodyfour · on Jan 31, 2020

In the early days of std::string multi-threading was much rarer, so reference counting was cheaper. As CPU count grows the relative cost of needing an atomic refcount on every string also grew.

The COW strings also were a constant source of surprising behavior. As with any COW object you have to make sure that the copy happens before and write. Simple enough, but the STL-style interface does not support this well. All you had to do is call operator[] or begin() on a non-const std::string and it would force a copy.

This meant that something that to the programmer clearly looked like a read-only operation like:

    if (str[2] == 'x') { ...}

...could end up copying the string. Worse, it also meant that any previously taken references are now invalidated. This lead to horrible bugs where you would hold a pointer at the previous copy of the string (which you no longer own a refcount on) and it would almost always work... but every so often another thread would come by and reuse the memory behind your back.

So COW std::string was a long festering mess in C++. Their removal in C++11 was completely warranted.

choeger · on Jan 31, 2020

Afaik it costs you some performance for long string operations.