One that we encountered in several services were gRPC ruby clients that semi-regularly blocked on responses for an indeterminate amount of time. We added lots of tracing data on the client and server to instrument where “slowness” was occurring. We would see every trace span look just like you would hope until the message went into the runtime and failed to get passed up to the caller for some random long period of time. Debugging what was happening between the network response (fast) and the actual parsed response being handed to the caller (slow) was quite frustrating, as it requires trying to dig into C bindings/runtime from the ruby client.
It was a couple of years ago, but the Go gRPC library had pretty broken flow control. gRPC depends upon both ends having an accurate picture of in-flight data volumes, both per-stream and per-transport (muxed connection). It's a rather complex protocol, and isn't rigorously specified for error cases. The main problem we encountered was that errors, especially timed-out transactions, would cause the gRPC library to lose track of buffer ownership (in the sense of host-to-host), and result in a permanent decrease of a transport's available in-flight capacity. Eventually it would hit zero and the two hosts would stop talking. Our solution was to patch-out the flow control (we already had app-level mechanisms).
[edit: The flow control is actually done at the HTTP/2 level. However, the Go gRPC library has its own implementation of HTTP/2.]
This isn’t to say it happened on every response..it was a relatively small fraction. But, it was enough to tell _something_ was going on. Who knows, it could be something quirky on the network and not even a gRPC issue. But, because the runtime is so opaque, it made debugging quite difficult.