Reminds me of a talk a few years ago about Facebook's internal implementation of std::string. They do the same short string optimization, but Facebook manages to outperform this implementation by storing 24 bytes (vs 23) in "short string mode".
IIRC, Facebook achieves this by using the last byte as the flag byte. To signify short string mode, this flag is set 0. This allows it to also serve as the null terminator. Tricky!
One important update is that the primary rationale for us, at the time, to change std::string's implementation to `fbstring` was to get SSO. We got huge perf wins for having SSO on std::string. (small string optimization for those unfamiliar with the initialism).
This, at the time, was a violation of the standard. By wording, SSO was not standard compliant.
Once the standard was changed (as of C++11) so that std::strings could have SSO, we removed our patches on the standard library and now use "vanilla" std::string (from libstdc++). We still have `folly::fbstring` for people who want that implementation (which is smaller, and has more in-situ capacity, because of various cleverness detailed in that talk), but it's no longer our `std::string`.
I think it's to avoid a branch in `size()`. If you use this trick, you have to check the flag if it's a small in-situ string or a "regular" heap-allocated string, because they they calculate size differently. In the standard library strings, the size is always just stored in the same position in memory, so `size()` has essentially no overhead when inlined, it's just a memory load.
Branches on their own aren't particularly expensive (especially not an easily predictable one like this), but they do screw with several compiler optimizations (auto-vectorization, for one). Given that, it seems like a reasonable decision, and the only way you'd know which one is faster is just testing it.
I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes, just to get longer strings using the SSO. The trade-off in the stack-space of the string seems like it would totally be worth it, and with 56 character long "small strings", a huge number of strings could fit in there (virtually all names, for instance). You probably couldn't do it for std::string (would wreck havoc with ABIs), but my hunch is that performance and memory usage might both benefit from the reduced memory allocations.
Yes there is some concern that it would affect code size and/or make some optimizations impossible, but on the latter I have yet to see hard data that that is the case. We're talking about adding about 1-2 cycles of latency.
Note that the branch can be implemented in a way that compilers will lower into a CMOV, making the code size issue almost moot. I implemented that in fbstring back in the day:
> I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes, just to get longer strings using the SSO
And libstdc++ is actually 32 bytes :( It's easy to template on the inline capacity, you can look at llvm::SmallVector or folly::small_vector, but vocabulary types should have the smallest possible footprint. Vast majority of instances are either empty, or very small (think keys in a map).
> I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes
The "optimal" implementation would actually have different allocation sizes when used mostly for processing and when used mostly for storage: for processing one would prefer bigger defaults, for storage, one needs as little as possible initial overhead.
That could be achieved by raising the knowledge of that distinction and introducing actually (at least) two kind of strings. And even more than that can be gained by recognizing that whenever the strings are involved in some more complex structures, some additional restrictions for totality of all associated strings could also be exploited: allocating the strings separately is an overhead by itself.
If I remember correctly, Turbo Pascal strings had the default size of 256 bytes and that's how much was on the stack for each, and that was fast even decades ago. That was not efficient for having a lot of them in the structures, if a lot of them could be smaller, but that was what a programmer had to care separately - still the typical processing of strings and working with the limited number of them was nicely covered by that default.
The 2^n-1 sizes are cache-line don’t need explanation, I think. 27 was the maximum length of a volume name in MFS, the original Macintosh file system (which was a strange mix of backwards (no directories) and forwards (255 character file names) thinking).
Str32 was used in AppleTalk (and probably a design error or bug; Str31 is a more ‘natural’ type)
You shouldn't need a branch though. The size would always be the last element of the buffer. It doubles as the null terminator if the size is exactly 23. The only additional operation is adding the static offset to the stored value to get the actual size.
It is probably just a small win and not everybody wanted to pay for the extra complexity (SSO is already fairly complex).
Re your suggestion on large SSO, years ago, when doing document clustering, using large fixed size strings (64 chars if I remember correctly, longer strings were just truncated) was a huge win for me (even coping them around wasn't an issue). Not sure if it is appropriate for a general purpose string though.
edit: never mind, I see what you mean about having to check the size.
Yeah, exactly :) when in SSO mode, it stores the size as a delta occupying a single byte, but when in heap-allocated mode, it stores the size as a an 8-byte size_t, and you require a branch to check which case is the right one. You can see the branch here in the source [0], and godbolt confirms that for std::string it's just a memory load [1]. A shame Folly isn't available on Compiler Explorer, would be interesting to compare.
Just a guess: the null byte is necessarily at the end of the 24 bytes rather than at the beginning. It might be in a different (typically 64-byte) cache line. Maybe one you don't need to read in from main RAM otherwise if you're handling a shorter string. That might make more of a performance difference than whether 23-byte strings (plus NUL byte) are short or long, depending on your application.
> Do you happen to know why the standard doesn’t use that trick?
Many of the early implementations of `std::string` were COW (copy-on-write), which is not entirely compatible with SSO strings. In practice, COW strings proved to be troublesome and slow, and the SSO strings won out. The standard was changed to effectively mandate SSO strings and forbid COW strings.
In the early days of std::string multi-threading was much rarer, so reference counting was cheaper. As CPU count grows the relative cost of needing an atomic refcount on every string also grew.
The COW strings also were a constant source of surprising behavior. As with any COW object you have to make sure that the copy happens before and write. Simple enough, but the STL-style interface does not support this well. All you had to do is call operator[] or begin() on a non-const std::string and it would force a copy.
This meant that something that to the programmer clearly looked like a read-only operation like:
if (str[2] == 'x') { ...}
...could end up copying the string. Worse, it also meant that any previously taken references are now invalidated. This lead to horrible bugs where you would hold a pointer at the previous copy of the string (which you no longer own a refcount on) and it would almost always work... but every so often another thread would come by and reuse the memory behind your back.
So COW std::string was a long festering mess in C++. Their removal in C++11 was completely warranted.
It’s funny that every company gets to the point where they have to have their own std::string. I wonder if Facebook also traces their need back to the time when GNU std::string was copy-on-write and therefore completely useless. Google, I think, can trace their std::string to those dark days. Luckily c++11 effectively outlawed COW string implementations, so nobody need be subject to that horror show again.
The language is fundamentally too weak to express the kind of reference and borrowing relationships demanded by CoW.
Suppose I have a string and I take a reference to a character:
std::string a = f(); // shared
char& c = a.front();
Should the string be copied or not? One might argue, no taking a reference is different from "writing" so no copying, but one might also argue that reference can be used to change the string later, and the string doesn't know when this would happen so it has to pessimistically make a copy.
You get bugs if you go with the former, and bad performance if you go with the latter.
Considering jq's libjq's jv[1] C API is CoW and written in C, I'm a bit disinclined to believe that C++ is too weak to express a reasonable CoW API for strings. Instead, I suspect that std::string was too weak a design for that.
The language is too weak to express it safely. CoW on mutable strings in C and C++ by necessity implies safety (or drastically performance-degrading) tradeoffs. By contrast, `jv` types are immutable so they can presumably implement CoW without compromising memory safety or correctness.
But `jv` is written in C. It requires that the programmer adhere to a convention, and that is probably what you mean by "too weak" -- and I would agree with that if that is what you mean.
All APIs have conventions though, and jv's is quite simple. It's very difficult to get it wrong and not crash immediately, though it's easy enough to get it wrong and leak. Having (unfortunately) had to use C a lot, I must say that the jv API is among the nicest I've had the pleasure of using. Sure, it's not Rust, but neither is C++, and yet that's the topic of this thread. But I'd use a jv-like API in C++, and I think that would be peachy.
Unfortunately the STL has an overload for the non-const version, which would cause this problem. It's hard to imagine the cases where you'd need a mutable reference to a single character. It can fit into the same register that the reference would live, so even as an immutable reference it doesn't make much sense.
That's not an answer. The STL has to have a non-const version that returns some kind of reference. Because C++ strings are mutable and even something like
s[3] = 'h';
involves forming a mutable reference and then assigning its pointee.
The way I think about it, to make CoW strings really usable in C++, you basically have to make strings immutable, like many newer languages like Python. That ship has sailed by the time standardization for C++ happened.
Atomic operations needed on multi-core, and the stuff that leaks a non-const reference to raw memory, like casual read-only use of operator[], .begin(), .end(), forces a copy on shared objects and prevents the object from being shared in the future.
In addition to what other people have said, the move semantics introduced in C++11 made a lot of the benefit of CoW moot: the point of CoW is to have a cheap copy operation, but most of those cheap copy operations can be replaced by even cheaper moves, often automatically by the compiler.
A copy-on-write string is pretty similar to a shared_ptr to char. Like a shared_ptr, a shared string has to check whether it is the last referent when it goes out of scope. This introduces a point of contention on multi-core systems. A string like the one described in this article might not have to do anything in its destructor, whereas a shared string will need to atomically decrement a reference count. That’s a huge difference in cost.
The benefit of a shared string is you might save some space. That’s fine but I think most programmers today will choose am explicitly shared type when they need it, and a fast string for most cases.
From what I've heard - COW means each string data may be pointed to by multiple objects, potentially being used by different threads. So every string operation requires thread synchronization, which can be quite expensive.
It's a tradeoff. In libc++'s version you still have string size stored in the top 7 bits so you just need a bitshift to get size. It sounds like fb's implementation would require looping until null terminator to get the size.
If you use the LSB of the last byte for a flag, you're using the 56th bit of one of the {data,size,cap} words. I'd use the MSB in this case, since you can shift that off easier.
To save those who, like me, were going to comment 'that would violate the standard because std::string::size is required to be O(1) complexity' a Google - the standard recommends but doesn't require that.
Facebook's implementation would still be constant time for short strings, because there's a constant which bounds the runtime.
Though I hear that the definition of big-O notation has shifted a bit in Silicon Valley these days so maybe that answer would get me in trouble in an interview.
It's true that big-O notation only concerns behavior with large N, but it's a bit disingenuous to say that the loop executes a constant number of times -- by that argument, you could say that if you implemented size() by strlen() it's O(1) because the string must be less than 2^64 bytes long on a 64-bit machine.
So I can see why someone would claim that implementing size() via strlen() "only" for small strings shouldn't be considered O(1), because strlen() is O(n) and within that class of strings the runtime is increasing as the length increases.
That is clever. Using last byte as the flag byte also lets one cast the std::string memory address to a null terminated c-string without intermediate buffers (not that one should).
Umm.. in any implementation you can already "cast" a std::string to a null-terminated string without a buffer (cf std::string::data() and std::string::c_str()) -- C++11 essentially mandates that. c_str must return "a pointer p such that p + i == &operator[](i) for each i in [0,size()]."
The history of those demonstrate some of the bat-shit insanity of the evolution of C++. Originally data didn't have to null-terminated, but then they finally realized that making c_str actually work without this invariant in general was nearly impossible. So since C++11 data and c_str are equivalent.
Edit: by impossible I really mean something usable with sane complexity requirements -- that wouldn't make people just immediately discard std::string for something better. This is the same C++ that brought you auto_ptr, and other garbage. C++11 corrected many things, this included.
It was possible with copy on write strings (e.g. libstdc++'s strings until C++11), but C++11 added complexity requirements that outlawed COW strings, because they're only really useful in very specific circumstances, at the cost of making the common case slow.
Imo this is probably in part because you can mix the two more easily? I imagine you could link again C++ 2003 code compiled with a compiler understanding the new ABI?
There was no need to plug in the NUL until somebody called c_str(). That possibility was lost in C++11. Now, if somebody plugs something else there (which is done, UB notwithstanding), c_str() produces the wrong answer.
NUL-terminated strings were one of the dumber things C++ inherited from C, and code that depends on NUL termination is not a thing to be proud of.
How do you "plugin" the NUL. If there's no space for it, then everytime you want a C string, you need to do an allocation. And why would you put something in it's place - as you said if you modify the c_str that was always UB -- so you pretty much lost all guarantees at that point -- that was just dumb.
Basically the NUL-terminator amounts to a 1-byte of wasted space -- which is almost always completely in the noise -- if you're using std::string for tiny strings you're paying at least 24 bytes anyway. There are just very few cases where that one extra byte matters.
I think the overwhelming opinion is that this 1-byte trade-off is better than the overhead of allocation and copying when you want to pass that string around. NUL-terminated strings aren't important in C++ because of C (well, not directly), but because they are essentially the ABI of countless existing libraries and the major operating systems.
> code that depends on NUL termination is not a thing to be proud of.
Whatever. Code that depends on NUL termination is ubiquitous.
>How do you "plugin" the NUL. If there's no space for it, then everytime you want a C string, you need to do an allocation.
You have not thought it through. There was always space for a NUL, but nobody needs it to have a NUL in it until somebody calls c_str(). Anyway, that was true until somebody jiggered the Standard, just before 1998, to require a NUL visible to s[s.size()].
>C Code that depends on NUL termination is ubiquitous.
> Anyway, that was true until somebody jiggered the Standard, just before 1998, to require a NUL visible to s[s.size()].
This is not technically true. You could plugin your hated NUL byte in the implementation of operator [].
> C Code that depends on NUL termination is ubiquitous.
No, that's not what I said.
It really doesn't matter what language underlying the API of all the code that expects NUL terminated strings is written in (a lot of it is C++ of course too). Windows, MacOS, and all POSIX-ish systems have a large API that consumes NUL terminated strings. NUL terminated strings are ubiquitous in computing at this point. Sure blame C from 40 years ago and burn a dmr effigy, I don't care, but that battle was long lost.
NUL terminated strings may be terrible, but the C++ accomodation for them is not -- its a well-thought out trade-off. My understanding is that neither go nor Rust make this trade-off. golangs FFI and syscall overhead has always been something of a performance disaster anyway. C++ has always had a greater demand to "play nice" with legacy code and systems then either of those.
The overhead of just having the NUL-byte is almost always a non-factor. If it really is, then use a byte vector.
Buggering op[] would be the slowest possible way to get the NUL there. Some of us would have liked for string operations not to be as absolutely slow as can be achieved. Evidently that is a minority preference.
If you think anybody is complaining about the extra space for the NUL, what you are barking up is not even a tree.
Are you saying that there should not be an extra 0 byte at the end until the string fills up and c_str() is called? Inconsistent allocations seem like a recipe for difficult to debug performance problems.
IIRC, Facebook achieves this by using the last byte as the flag byte. To signify short string mode, this flag is set 0. This allows it to also serve as the null terminator. Tricky!