Reminds me of a talk a few years ago about Facebook's internal implementation of...

lbrandy · on Jan 31, 2020

hey i've worked on this stuff for awhile at fb.

One important update is that the primary rationale for us, at the time, to change std::string's implementation to `fbstring` was to get SSO. We got huge perf wins for having SSO on std::string. (small string optimization for those unfamiliar with the initialism).

This, at the time, was a violation of the standard. By wording, SSO was not standard compliant.

Once the standard was changed (as of C++11) so that std::strings could have SSO, we removed our patches on the standard library and now use "vanilla" std::string (from libstdc++). We still have `folly::fbstring` for people who want that implementation (which is smaller, and has more in-situ capacity, because of various cleverness detailed in that talk), but it's no longer our `std::string`.

For reference: https://github.com/facebook/folly/blob/master/folly/FBString...

tybit · on Jan 31, 2020

Do you happen to know why the standard doesn’t use that trick?

It always impressed me how clever it was, but once you know it it seems fairly clean and maintainable with no obvious downsides.

OskarS · on Jan 31, 2020

I think it's to avoid a branch in `size()`. If you use this trick, you have to check the flag if it's a small in-situ string or a "regular" heap-allocated string, because they they calculate size differently. In the standard library strings, the size is always just stored in the same position in memory, so `size()` has essentially no overhead when inlined, it's just a memory load.

Branches on their own aren't particularly expensive (especially not an easily predictable one like this), but they do screw with several compiler optimizations (auto-vectorization, for one). Given that, it seems like a reasonable decision, and the only way you'd know which one is faster is just testing it.

I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes, just to get longer strings using the SSO. The trade-off in the stack-space of the string seems like it would totally be worth it, and with 56 character long "small strings", a huge number of strings could fit in there (virtually all names, for instance). You probably couldn't do it for std::string (would wreck havoc with ABIs), but my hunch is that performance and memory usage might both benefit from the reduced memory allocations.

ot · on Jan 31, 2020

Yes there is some concern that it would affect code size and/or make some optimizations impossible, but on the latter I have yet to see hard data that that is the case. We're talking about adding about 1-2 cycles of latency.

Note that the branch can be implemented in a way that compilers will lower into a CMOV, making the code size issue almost moot. I implemented that in fbstring back in the day:

https://github.com/facebook/folly/commit/be4c6d6b3e21914df8a...

This is the entirety of size():

    0f b6 57 17             movzbl 0x17(%rdi),%edx
    b8 17 00 00 00          mov    $0x17,%eax
    48 29 d0                sub    %rdx,%rax
    48 0f 48 47 08          cmovs  0x8(%rdi),%rax

A bit more than a MOV, but not that much.

> I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes, just to get longer strings using the SSO

And libstdc++ is actually 32 bytes :( It's easy to template on the inline capacity, you can look at llvm::SmallVector or folly::small_vector, but vocabulary types should have the smallest possible footprint. Vast majority of instances are either empty, or very small (think keys in a map).

acqq · on Jan 31, 2020

> I'd personally think it would be interesting to test artificially increasing the sizeof() of strings from 24 bytes (the minimum needed to store a pointer, size and capacity) to something like 64 bytes

The "optimal" implementation would actually have different allocation sizes when used mostly for processing and when used mostly for storage: for processing one would prefer bigger defaults, for storage, one needs as little as possible initial overhead.

That could be achieved by raising the knowledge of that distinction and introducing actually (at least) two kind of strings. And even more than that can be gained by recognizing that whenever the strings are involved in some more complex structures, some additional restrictions for totality of all associated strings could also be exploited: allocating the strings separately is an overhead by itself.

If I remember correctly, Turbo Pascal strings had the default size of 256 bytes and that's how much was on the stack for each, and that was fast even decades ago. That was not efficient for having a lot of them in the structures, if a lot of them could be smaller, but that was what a programmer had to care separately - still the typical processing of strings and working with the limited number of them was nicely covered by that default.

Someone · on Jan 31, 2020

Mac OS similarly, over time, grew various short Pascal string types to be used in its APIs. https://opensource.apple.com/source/CarbonHeaders/CarbonHead...:

    typedef unsigned char Str255[256];
    typedef unsigned char Str63[64];
    typedef unsigned char Str32[33];
    typedef unsigned char Str31[32];
    typedef unsigned char Str27[28];
    typedef unsigned char Str15[16];

The 2^n-1 sizes are cache-line don’t need explanation, I think. 27 was the maximum length of a volume name in MFS, the original Macintosh file system (which was a strange mix of backwards (no directories) and forwards (255 character file names) thinking).

Str32 was used in AppleTalk (and probably a design error or bug; Str31 is a more ‘natural’ type)

gpderetta · on Jan 31, 2020

You shouldn't need a branch though. The size would always be the last element of the buffer. It doubles as the null terminator if the size is exactly 23. The only additional operation is adding the static offset to the stored value to get the actual size.

It is probably just a small win and not everybody wanted to pay for the extra complexity (SSO is already fairly complex).

Re your suggestion on large SSO, years ago, when doing document clustering, using large fixed size strings (64 chars if I remember correctly, longer strings were just truncated) was a huge win for me (even coping them around wasn't an issue). Not sure if it is appropriate for a general purpose string though.

edit: never mind, I see what you mean about having to check the size.

OskarS · on Jan 31, 2020

Yeah, exactly :) when in SSO mode, it stores the size as a delta occupying a single byte, but when in heap-allocated mode, it stores the size as a an 8-byte size_t, and you require a branch to check which case is the right one. You can see the branch here in the source [0], and godbolt confirms that for std::string it's just a memory load [1]. A shame Folly isn't available on Compiler Explorer, would be interesting to compare.

[0]: https://github.com/facebook/folly/blob/master/folly/FBString...

[1]: https://godbolt.org/z/U-LvGm

scottlamb · on Jan 31, 2020

Just a guess: the null byte is necessarily at the end of the 24 bytes rather than at the beginning. It might be in a different (typically 64-byte) cache line. Maybe one you don't need to read in from main RAM otherwise if you're handling a shorter string. That might make more of a performance difference than whether 23-byte strings (plus NUL byte) are short or long, depending on your application.

bregma · on Jan 31, 2020

> Do you happen to know why the standard doesn’t use that trick?

Many of the early implementations of `std::string` were COW (copy-on-write), which is not entirely compatible with SSO strings. In practice, COW strings proved to be troublesome and slow, and the SSO strings won out. The standard was changed to effectively mandate SSO strings and forbid COW strings.

bodyfour · on Jan 31, 2020

In the early days of std::string multi-threading was much rarer, so reference counting was cheaper. As CPU count grows the relative cost of needing an atomic refcount on every string also grew.

The COW strings also were a constant source of surprising behavior. As with any COW object you have to make sure that the copy happens before and write. Simple enough, but the STL-style interface does not support this well. All you had to do is call operator[] or begin() on a non-const std::string and it would force a copy.

This meant that something that to the programmer clearly looked like a read-only operation like:

    if (str[2] == 'x') { ...}

...could end up copying the string. Worse, it also meant that any previously taken references are now invalidated. This lead to horrible bugs where you would hold a pointer at the previous copy of the string (which you no longer own a refcount on) and it would almost always work... but every so often another thread would come by and reuse the memory behind your back.

So COW std::string was a long festering mess in C++. Their removal in C++11 was completely warranted.

choeger · on Jan 31, 2020

Afaik it costs you some performance for long string operations.

pjscott · on Jan 31, 2020

I think this is the talk you're referring to. If not, it's still definitely worth a watch:

https://www.youtube.com/watch?v=kPR8h4-qZdk

stuffypages · on Jan 31, 2020

Part of the motivation for writing this blog post was that I enjoyed this talk so much.

thedance · on Jan 31, 2020

It’s funny that every company gets to the point where they have to have their own std::string. I wonder if Facebook also traces their need back to the time when GNU std::string was copy-on-write and therefore completely useless. Google, I think, can trace their std::string to those dark days. Luckily c++11 effectively outlawed COW string implementations, so nobody need be subject to that horror show again.

faitswulff · on Jan 31, 2020

I'm not very good at systems programming, what makes a CoW std::string a bad thing? Memory overhead for a common data structure?

kccqzy · on Jan 31, 2020

The language is fundamentally too weak to express the kind of reference and borrowing relationships demanded by CoW.

Suppose I have a string and I take a reference to a character:

    std::string a = f(); // shared
    char& c = a.front();

Should the string be copied or not? One might argue, no taking a reference is different from "writing" so no copying, but one might also argue that reference can be used to change the string later, and the string doesn't know when this would happen so it has to pessimistically make a copy.

You get bugs if you go with the former, and bad performance if you go with the latter.

cryptonector · on Jan 31, 2020

Considering jq's libjq's jv[1] C API is CoW and written in C, I'm a bit disinclined to believe that C++ is too weak to express a reasonable CoW API for strings. Instead, I suspect that std::string was too weak a design for that.

[1] https://github.com/stedolan/jq/wiki/C-API:-jv

klmr · on Jan 31, 2020

The language is too weak to express it safely. CoW on mutable strings in C and C++ by necessity implies safety (or drastically performance-degrading) tradeoffs. By contrast, `jv` types are immutable so they can presumably implement CoW without compromising memory safety or correctness.

cryptonector · on Jan 31, 2020

But `jv` is written in C. It requires that the programmer adhere to a convention, and that is probably what you mean by "too weak" -- and I would agree with that if that is what you mean.

kccqzy · on Feb 1, 2020

Convention: every malloc() may result in NULL and programs should check for it

Result: many programs forgot

Convention: every non-NULL pointer returned by malloc() should be passed to free() at most once

Result: programs free()'d these pointers twice

Convention: programs should only write to memory addresses [p, p+s) if p is a non-NULL pointer returned by malloc(s), and before calling free(p)

Result: many programs wrote beyond p+s, or wrote to the memory after calling free()

Remind me again, what's a convention?

cryptonector · on Feb 1, 2020

All APIs have conventions though, and jv's is quite simple. It's very difficult to get it wrong and not crash immediately, though it's easy enough to get it wrong and leak. Having (unfortunately) had to use C a lot, I must say that the jv API is among the nicest I've had the pleasure of using. Sure, it's not Rust, but neither is C++, and yet that's the topic of this thread. But I'd use a jv-like API in C++, and I think that would be peachy.

zelly · on Jan 31, 2020

    const char& c = a.front();

Unfortunately the STL has an overload for the non-const version, which would cause this problem. It's hard to imagine the cases where you'd need a mutable reference to a single character. It can fit into the same register that the reference would live, so even as an immutable reference it doesn't make much sense.

kccqzy · on Jan 31, 2020

That's not an answer. The STL has to have a non-const version that returns some kind of reference. Because C++ strings are mutable and even something like

    s[3] = 'h';

involves forming a mutable reference and then assigning its pointee.

The way I think about it, to make CoW strings really usable in C++, you basically have to make strings immutable, like many newer languages like Python. That ship has sailed by the time standardization for C++ happened.

gumby · on Jan 31, 2020

There’s nothing stopping you from making const strings. That’s why immutable strings are not added.

gpderetta · on Jan 31, 2020

s[3] could return a proxy though. Probably still not worth the complexity.

SamReidHughes · on Jan 31, 2020

Atomic operations needed on multi-core, and the stuff that leaks a non-const reference to raw memory, like casual read-only use of operator[], .begin(), .end(), forces a copy on shared objects and prevents the object from being shared in the future.

OskarS · on Jan 31, 2020

In addition to what other people have said, the move semantics introduced in C++11 made a lot of the benefit of CoW moot: the point of CoW is to have a cheap copy operation, but most of those cheap copy operations can be replaced by even cheaper moves, often automatically by the compiler.

thedance · on Jan 31, 2020

A copy-on-write string is pretty similar to a shared_ptr to char. Like a shared_ptr, a shared string has to check whether it is the last referent when it goes out of scope. This introduces a point of contention on multi-core systems. A string like the one described in this article might not have to do anything in its destructor, whereas a shared string will need to atomically decrement a reference count. That’s a huge difference in cost.

The benefit of a shared string is you might save some space. That’s fine but I think most programmers today will choose am explicitly shared type when they need it, and a fast string for most cases.

yongjik · on Jan 31, 2020

From what I've heard - COW means each string data may be pointed to by multiple objects, potentially being used by different threads. So every string operation requires thread synchronization, which can be quite expensive.

acqq · on Jan 31, 2020

Does anybody know at which point exactly the GNU std::string implementation changed? Like “since version number N.N of the project X”?

scatters · on Jan 31, 2020

Since GCC 5.1. The change was and is done in a backward and forward compatible manner, using dual ABI: https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_a...

api · on Jan 31, 2020

So your own std::string means you have "arrived?"

spac · on Jan 31, 2020

Bloomberg has also developed a drop in replacement for std::string with SSO, and has a very well documented header file.

https://github.com/bloomberg/bde/blob/master/groups/bsl/bsls...

fyp · on Jan 31, 2020

It's a tradeoff. In libc++'s version you still have string size stored in the top 7 bits so you just need a bitshift to get size. It sounds like fb's implementation would require looping until null terminator to get the size.

edflsafoiewq · on Jan 31, 2020

IIRC you store (23 - size) in the last byte in "small string" mode, so when the size gets to 23, the last byte is 0, doubling as the null terminator.

lilyball · on Jan 31, 2020

Surely it must be (23 - size) << 1, otherwise you won't guarantee a 0 in the LSB.

SamReidHughes · on Jan 31, 2020

On little endian systems it uses two MSb's, on big endian systems it uses two LSb's. There are actually two flag bits.

edflsafoiewq · on Jan 31, 2020

If you use the LSB of the last byte for a flag, you're using the 56th bit of one of the {data,size,cap} words. I'd use the MSB in this case, since you can shift that off easier.

roel_v · on Jan 31, 2020

To save those who, like me, were going to comment 'that would violate the standard because std::string::size is required to be O(1) complexity' a Google - the standard recommends but doesn't require that.

epistasis · on Jan 31, 2020

Facebook's implementation would still be constant time for short strings, because there's a constant which bounds the runtime.

Though I hear that the definition of big-O notation has shifted a bit in Silicon Valley these days so maybe that answer would get me in trouble in an interview.

ryani · on Feb 3, 2020

It's true that big-O notation only concerns behavior with large N, but it's a bit disingenuous to say that the loop executes a constant number of times -- by that argument, you could say that if you implemented size() by strlen() it's O(1) because the string must be less than 2^64 bytes long on a 64-bit machine.

So I can see why someone would claim that implementing size() via strlen() "only" for small strings shouldn't be considered O(1), because strlen() is O(n) and within that class of strings the runtime is increasing as the length increases.

coolplants · on Feb 4, 2020

I think it’s a bit more disingenuous to compare the magnitudes of 2^64 and 23 for the sake of argument, as if 2^64 isn’t practically asymptotic.

coolplants · on Jan 31, 2020

Even if the standard required O(1) it would conform because the loop is bounded (<= 23), it’s not O(n)

slavik81 · on Jan 31, 2020

To be O(N), the time required for the lookup would have to continue to grow indefinitely. This has an upper bound.

rwbt · on Jan 31, 2020

That is clever. Using last byte as the flag byte also lets one cast the std::string memory address to a null terminated c-string without intermediate buffers (not that one should).

braindeath · on Jan 31, 2020

Umm.. in any implementation you can already "cast" a std::string to a null-terminated string without a buffer (cf std::string::data() and std::string::c_str()) -- C++11 essentially mandates that. c_str must return "a pointer p such that p + i == &operator[](i) for each i in [0,size()]."

The history of those demonstrate some of the bat-shit insanity of the evolution of C++. Originally data didn't have to null-terminated, but then they finally realized that making c_str actually work without this invariant in general was nearly impossible. So since C++11 data and c_str are equivalent.

Edit: by impossible I really mean something usable with sane complexity requirements -- that wouldn't make people just immediately discard std::string for something better. This is the same C++ that brought you auto_ptr, and other garbage. C++11 corrected many things, this included.

murderfs · on Jan 31, 2020

It was possible with copy on write strings (e.g. libstdc++'s strings until C++11), but C++11 added complexity requirements that outlawed COW strings, because they're only really useful in very specific circumstances, at the cost of making the common case slow.

zelly · on Jan 31, 2020

C++11 was really a new language. It even had a whole new ABI. I'm surprised the transition went smoothly and not like Python 2 to 3.

foota · on Jan 31, 2020

Imo this is probably in part because you can mix the two more easily? I imagine you could link again C++ 2003 code compiled with a compiler understanding the new ABI?

zelly · on Jan 31, 2020

Yes you could[1], but you couldn't link C++11 object code with a C++03 compiler.

[1] with a compiler flag use_cxx11_abi=0 or something

gpderetta · on Jan 31, 2020

If the c++11 object code does not expose any c++11 specific features in its abi/api you should be able to.

ncmncm · on Jan 31, 2020

There was no need to plug in the NUL until somebody called c_str(). That possibility was lost in C++11. Now, if somebody plugs something else there (which is done, UB notwithstanding), c_str() produces the wrong answer.

NUL-terminated strings were one of the dumber things C++ inherited from C, and code that depends on NUL termination is not a thing to be proud of.

braindeath · on Jan 31, 2020

How do you "plugin" the NUL. If there's no space for it, then everytime you want a C string, you need to do an allocation. And why would you put something in it's place - as you said if you modify the c_str that was always UB -- so you pretty much lost all guarantees at that point -- that was just dumb.

Basically the NUL-terminator amounts to a 1-byte of wasted space -- which is almost always completely in the noise -- if you're using std::string for tiny strings you're paying at least 24 bytes anyway. There are just very few cases where that one extra byte matters.

I think the overwhelming opinion is that this 1-byte trade-off is better than the overhead of allocation and copying when you want to pass that string around. NUL-terminated strings aren't important in C++ because of C (well, not directly), but because they are essentially the ABI of countless existing libraries and the major operating systems.

> code that depends on NUL termination is not a thing to be proud of.

Whatever. Code that depends on NUL termination is ubiquitous.

ncmncm · on Jan 31, 2020

>How do you "plugin" the NUL. If there's no space for it, then everytime you want a C string, you need to do an allocation.

You have not thought it through. There was always space for a NUL, but nobody needs it to have a NUL in it until somebody calls c_str(). Anyway, that was true until somebody jiggered the Standard, just before 1998, to require a NUL visible to s[s.size()].

>C Code that depends on NUL termination is ubiquitous.

Fixed that for you.

braindeath · on Jan 31, 2020

> Anyway, that was true until somebody jiggered the Standard, just before 1998, to require a NUL visible to s[s.size()].

This is not technically true. You could plugin your hated NUL byte in the implementation of operator [].

> C Code that depends on NUL termination is ubiquitous.

No, that's not what I said. It really doesn't matter what language underlying the API of all the code that expects NUL terminated strings is written in (a lot of it is C++ of course too). Windows, MacOS, and all POSIX-ish systems have a large API that consumes NUL terminated strings. NUL terminated strings are ubiquitous in computing at this point. Sure blame C from 40 years ago and burn a dmr effigy, I don't care, but that battle was long lost.

NUL terminated strings may be terrible, but the C++ accomodation for them is not -- its a well-thought out trade-off. My understanding is that neither go nor Rust make this trade-off. golangs FFI and syscall overhead has always been something of a performance disaster anyway. C++ has always had a greater demand to "play nice" with legacy code and systems then either of those.

The overhead of just having the NUL-byte is almost always a non-factor. If it really is, then use a byte vector.

ncmncm · on Feb 1, 2020

Buggering op[] would be the slowest possible way to get the NUL there. Some of us would have liked for string operations not to be as absolutely slow as can be achieved. Evidently that is a minority preference.

If you think anybody is complaining about the extra space for the NUL, what you are barking up is not even a tree.

braindeath · on Feb 1, 2020

> Buggering op[] would be the slowest possible way to get the NUL there.

> Evidently that is a minority preference.

What evidence? Pointing out a fact isn't an endorsement.

> If you think anybody is complaining about the extra space for the NUL

Then what exactly are you complaining about? Setting the byte to 0? If not, why are you being so obtuse?

> what you are barking up is not even a tree.

In this thread your tone is repeatedly that of a condescending jerk.

ncmncm · on Feb 1, 2020

You have presented your credentials.

BubRoss · on Jan 31, 2020

Are you saying that there should not be an extra 0 byte at the end until the string fills up and c_str() is called? Inconsistent allocations seem like a recipe for difficult to debug performance problems.

ncmncm · on Feb 1, 2020

> Are you saying that there should not be an extra 0 byte at the end until the string fills up and c_str() is called?

No.

continuational · on Jan 31, 2020

If you use one of the bytes for a null terminator, then you can't use it for a character in the string, so isn't that 23 characters?

garmaine · on Jan 31, 2020

Yes, but by that accounting the other mechanism has only 22 characters.