Note that this implementation has undefined behavior according to the standard because union members are accessed that haven't been written to. The code that tests whether it the string is long or short, accesses members of the union unconditionally.
This is fine because std::string is provided by the standard library and the standard library is allowed to do stuff normal libraries are not allowed to do. This technique does work in practice, but it is technically undefined behavior.
I'm not very familiar with the C++ standard, but in the C standard a load from a different union member than the last store isn't undefined behavior, it's unspecified behavior. That's a big difference. So long as the store doesn't generate a trap representation (and most architectures don't have such representations anymore) when reinterpreted as the new type, then you can rely on the behavior, it's just that you'll need to refer to compiler documentation to understand the values reliably produced.
Moreover, there are additional guarantees when reading through char types, as well as unsigned types, so depending on the precise code things may very well be totally well defined.
C doesn't disallow type punning. The gotchas mostly have to do with visibility to the compiler, otherwise the compiler may reorder loads and stores. The safest way to do type punning is through union members as the standard makes additional guarantees that restrict the types of optimizations a compiler can make.
>> Note that this implementation has undefined behavior according to the standard because union members are accessed that haven't been written to.
Except when the union member you are reading is a 'common initial sequence' [1] of the union member that was written, to be precise ;-). But that's not the case here.
Undefined behavior in C++ and C means that it's up to the implementor to decide what happens in a situation. Usually it ends up being whatever is most convenient or most performant on the target architecture.
One consequence is that portable code can't depend on any particular behavior because it can be different between implementations. Every implementation will do something, but each one may do something completely different.
Another consequence is that "undefined behavior" isn't undefined in the context of a specific implementation because you can look at what it does and see how it's defined in that implementation. In this case, libc++ is essentially part of the implementation, so it's fair game for it to depend on implementation details of clang.
> Undefined behavior in C++ and C means that it's up to the implementor to decide what happens in a situation
> Every implementation will do something, but each one may do something completely different.
If I'm reading this correctly, you're saying that we can depend on each compiler providing a consistent way of handling each kind of undefined behaviour.
That's not correct. That describes implementation-defined behaviour, which is different. [0]
Compilers do not have to decide what behaviour should result from a particular kind of undefined behaviour, and then commit to ensuring that behaviour occurs consistently. That's the point of undefined behaviour: the compiler is permitted to assume the absence of undefined behaviour, and to optimise accordingly.
If you have undefined behaviour in your C++ code, you are not guaranteed to see consistent program behaviour. Your program is ill-formed. All bets are off, throughout the entire lifetime of your program. [1] (In C++, undefined behaviour can 'travel back in time', meaning that if your program invokes undefined behaviour, the behaviour across the entire lifetime of your program is made undefined.)
A compiler may choose to commit to a certain behaviour for a certain type of undefined behaviour (such as guaranteeing wrap-around behaviour for signed overflow), but it is not required to.
A compiler is required to define a consistent value for sizeof(int), because that's implementation-defined. [0]
> Another consequence is that "undefined behavior" isn't undefined in the context of a specific implementation because you can look at what it does and see how it's defined in that implementation.
This isn't right.
Unless the compiler's documentation tells you that you can rely upon its handling of the relevant undefined behaviour, then then compiler is not required to provide consistent behaviour for any particular kind of undefined behaviour.
> In this case, libc++ is essentially part of the implementation
Its about maintenance. Things which are "undefined behavior" can actually have a defined behavior if the compiler provides it. By doing something undefined in the stl, the compiler authors are saying that the compiler will support it and if the compiler changes, they authors will also change the library. If you rely on this behavior, theres no guarantee that the compiler wont change this behavior in a future update thereby breaking your code.
It will only ever be compiled with clang, so as long as clang doesn't implement this behavior in a way that will cause it to be incorrect, that's fine.
If it were to be special-cased, it would probably require an attribute or a pragma or something. While it's not unheard of for compilers to automatically detect they are compiling the standard library, it's fairly rare.
libc++ is compiled by compilers other than clang. It's a completely invalid assumption that it will only ever be compiled by clang, because it's never only over been compiled by clang.
I say this as someone with a full time paid job supporting libc++ compiled by another compiler for a commercial organization in a safety context.
This is fine because std::string is provided by the standard library and the standard library is allowed to do stuff normal libraries are not allowed to do. This technique does work in practice, but it is technically undefined behavior.