One of the things that I don't like about TCP is that it really goes out of its way to emulate a stream of data over a wire. This then leads to things like Nagle's algorithm that deliberately slow it down.
This is problematic when doing things like HTTP, or any of the infinite RPC schemes that come into popularity every few years. (REST, Thrift, Protocol Buffers, ect.) As a programmer, we're trying to send a request of known length and get a response where the server knows the length. As TCP makes this looks like a wire stream, it means that the protocol implementer needs to disable Nagle and implement logic that figures out where the requests begin and end; even though most likely these boundaries are the packet boundaries.
Think of it this way: An HTTP GET request may fit into one or two IP packets; but the nature of TCP streams means that the server's HTTP API needs to find the \n\n to know when the request headers end. Instead, the stream API itself should just be able to say, "here's a packet of 300 bytes!" Furthermore, if the client didn't disable Nagle, its TCP library will deliberately delay sending the header for 200 ms.
The reality of the TCP API is that it's great for long streams of data. It just isn't optimized for half of its use case.
It's a common misconception that Nagle's causes the delay problem. The problem is tcp delayed acks. Nagle's can improve network efficiency (and latency as a result) on slow links. On fast links it shouldn't have much impact at all.
The rest of what you said doesn't make much sense to me. Searching for \n\n is a property of HTTP, it's nothing to do with TCP. TCP is meant as a generic connection oriented protocol that uses packet switching, and it's largely optimized for that.
The main benefit of QUIC it seems is that it hides itself in UDP so middleboxes don't tamper with it. It isn't really a fault of TCP itself that middleboxes mess it up so much.
The Nagle toggle and relegating record boundaries to application level sound like quite small potatoes (and have significant upsides too). Those things probably wouldn't be the main things to optimize if protocol designers could start from stratch with perfect hind sight.
That doesn't make much sense. Http being that way was a deliberate decision by its creators. They could as well have designed it so that the request header is prefixed by its length as a 4 byte value, for example. You definitely still want the stream representation for that and not juggle around with packet fragmentation yourself. You can simply disable nagle for your own protocol if it speeds up things for you. The only thing different in your suggestion is that the length prefix adding and handling should be done by the network api instead of your app, which I don't think is that much of a hassle anyways.
In RHEL 7, using TCP_NODELAY and FAST_ACK still results in IP level conflation. Even when you turn off hardware coalescence. Only way i've been able to get one push = one tcp packet is via solarflare. Even then you have to disable a SF specific batching amount that still kicks in when nagle is disabled
This is problematic when doing things like HTTP, or any of the infinite RPC schemes that come into popularity every few years. (REST, Thrift, Protocol Buffers, ect.) As a programmer, we're trying to send a request of known length and get a response where the server knows the length. As TCP makes this looks like a wire stream, it means that the protocol implementer needs to disable Nagle and implement logic that figures out where the requests begin and end; even though most likely these boundaries are the packet boundaries.
Think of it this way: An HTTP GET request may fit into one or two IP packets; but the nature of TCP streams means that the server's HTTP API needs to find the \n\n to know when the request headers end. Instead, the stream API itself should just be able to say, "here's a packet of 300 bytes!" Furthermore, if the client didn't disable Nagle, its TCP library will deliberately delay sending the header for 200 ms.
The reality of the TCP API is that it's great for long streams of data. It just isn't optimized for half of its use case.