Should also note that the "arrays are values" thing doesn't actually matter in practice. I've written a wide variety of go code in the last > 4 years and I can only think of two places I've used an array.
"<-c", regardless of what's on the left side of it will block in every case until there's a messages available in the channel (note that close is a message).
I just notice this section has been removed. I'll be less angry now. :)
It's one at a time. But, since the server only performs a one-shot task without the "hang on and wait 30 seconds"-like long connections, and, the default socket backlog of TcpServer is > 100, so every client gets served within the delay of (0~99)*6ms.
In short, every round is fully served, and the concurrency level is 100.
It doesn't return the exact same result, but since you're not verifying the results, it is effectively the same (4 bytes in, 4 bytes back out). I did slightly better with a hand-crafted one.
I made a small change to the way the client is working, buffering reads and writes independently (can be observed later) and I get similar numbers (dropped my local runs from ~12 to .038). This is that version: http://play.golang.org/p/8fR6-y6EBy
Now, I don't know scala, but based on the constraints of the program, these actually all do the same thing. They time how long it takes to write 4 bytes * N and read 4 bytes * N. (my version adds error checking). The go version is reporting a bit more latency going in and out of the stack for individual syscalls.
I suspect the scala version isn't even making those, as it likely doesn't need to observe the answers.
You just get more options in a lower level language.
I think you're on the right track in supposing that there can't be a huge performance difference in such a simple task, given that both languages are compiled and reasonably low-level. The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT. Your suggestion to try server-{a,b} x client-{a,b} is also a good one.
Your modified Go server doesn't return "Pong" for "Ping". It returns "Ping". And the "a small change" version is nonsense. It's fundamentally different. - you're firing off all your requests before waiting for any replies, and so hiding the latency in the more common RPC style request-response chain, which is a real problem.
You speculate a lot ("hiding some magic" "likely doesn't need to observe the answers") when you haven't offered any insight.
EDIT: Nagle doesn't matter here - it doesn't delay any writes once you read (waiting server response). It only affects 2+ consecutive small writes (here I'm trusting http://en.wikipedia.org/wiki/Nagle's_algorithm - my own recollection was fuzzy). If Go sleeps client threads between the ping and the read-response call then I suppose it would matter (but only a little? and other comments say that Go defaults to no Nagle alg. anyway).
> The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT.
Really, the most plausible explanation? I'd say the most plausible explanation is that M:N scheduling has always been bad at latency and fair scheduling. That's why everybody else abandoned it when that matters. It's basically only good for when fair and efficient scheduling doesn't matter, like maths for instance, which is why it's still used in Haskell and Rust. I wouldn't be surprised to see Rust at least abandon M:N soon though once they start really optimizing performance.
Interestingly, both the go client and the scala client perform the same speed when talking to the scala server (~3.3s total), but the scala client performs much faster when talking to the go server (~1.9s total), whereas the go client performs much worse (~23s total, ~15s with GC disabled).
I thought the difference might partly be in socket buffering on the client, so I printed the size of the send and receive buffers on the socket in the scala client, and set them the same on the socket in the go client. This didn't actually bring the time down. Huh.
My next thought was that scala is somehow being more parallel when it evaluates the futures in Await.result. Running `tcpdump -i lo tcp port 1201` seems to confirm this. The scala client has a lot more parallelism (judging by packet sequence ids). Is that really because go's internal scheduling of goroutines is causing lock contention or lots of context switching?
> Current goroutine scheduler limits scalability of concurrent programs written in Go, in particular, high-throughput servers and parallel computational programs. Vtocc server maxes out at 70% CPU on 8-core box, while profile shows 14% is spent in runtime.futex(). In general, the scheduler may inhibit users from using idiomatic fine-grained concurrency where performance is critical.
Bear in mind that was written before Go 1.1, additionally Dimitry has made steps to address CPU underutilization and has been working with the rest of the Go team on preemption. I think these improvements will make it into Go 1.2, fingers crossed.
Best response here. I spent weeks trying to get a go OpenFlow controller on par with Floodlight (java). I finally gave up on tcp performance and moved on when I realized scheduling was the problem.
Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?
The comments on the article page have a different report which doesn't suffer from this implausibility:
> Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?
I've been curious about that as well. The major slowdown seems to be related to a specific combination of go server and client. I don't have a good explanation. I'd love to hear from someone familiar with go internals.
> go server + go client 22.02125152
> ...
> scala server + go client 4.766823392
I'm curious: are you saying Go is M:N and JVM is not? I had to look up M:N - http://en.wikipedia.org/wiki/Thread_(computing)#M:N_.28Hybri... - but ultimately I don't know anything about JVM or Go threading, and your comment didn't go enough into detail for me to follow your reasoning.
Yes I forget the audience. Go uses M:N scheduling meaning that the OS has M threads and Go multiplexes N of its own threads on top of these. The JVM uses N:1 like basically every other program where the kernel does all scheduling.
The basic problem with M:N scheduling is that the OS and program work against each other because they have imperfect information, causing inefficiencies.
Yes, but can Go actually use anything else? Finely-grained concurrency after the CSP fashion, after all, is the whole driving force behind it, and it's in the language spec.
Are hybrid approaches worth it (exposing some details so that Go network server can get the right service from the OS)? I'm not sure how much language complexity Go-nuts will take, so they'll probably look for clever heuristic tweaks instead.
You can turn off M:N on a per-thread (really per-thread-group) basis in Rust and we've been doing that for a while in parts of Servo. For example, the script/layout threads really want to be a separate thread from GL compositing.
Userland scheduling is still nice for optimizing synchronous RPC-style message sends so that they can switch directly to the target task without a trip through the scheduler. It's also nice when you want to implement work stealing.
Can you just have 1 thread per running task and give the thread back to a pool when the task waits for messages? Then for synchronous RPC you can swap the server task onto the current thread without OS scheduling and swap it back when it's done. You just need a combined 'send response and get next message' operation so the server and client can be swapped back again. This seems way easier and more robust, and you don't need work stealing since each running task has its own thread... what am I missing?
It doesn't work if you want to optimistically switch to the receiving task, but keep the sending task around with some work that it might like to do if other CPUs become idle. (For example, we've thought about scheduling JS GCs this way while JS is blocked on layout.)
Is the OS not scheduling M runnable threads on N cores? Blocking/non-blocking is just an API distinction, and languages implement one in terms of the other.
They are threads. Technically they are "green threads". The runtime does not map them to OS threads, although technically if it chose to it could, because goroutines are abstract things and the mapping to real threads is a platform decision.
Buffering impacts performance when it transforms many small writes into one big write (same for reads). In that case since you are waiting for the answer at every iteration, I'm not sure I see how it could have an impact.
> Your modified Go server doesn't return "Pong" for "Ping".
The program doesn't read the result, so it doesn't matter. Returning Pong isn't harder, but why write all that code if it's going to be ignored anyway?
> It's fundamentally different. - you're firing off all your requests before waiting for any replies, and so hiding the latency in the more common RPC style request-response chain, which is a real problem.
As I said, the program isn't correlating the responses with the requests in the first place -- or even validating it got one. I don't know scala, but I've done enough benchmarking to watch even less sophisticated compilers do weird things with ignored values.
I made a small change that produced semantically the same program (same validation, etc...). It had similar performance to the scala one. If you don't think that's helpful, then add further constraints.
Compilers do not restructure a causal chain of events between a client and server in a different process. It's very easy to understand this when you realize that send -> wait for response and read it will result in certain system calls, no matter the language.
[Send 4 bytes * 200, then (round trip latency later) receive 4 bytes * 200] is fundamentally different than [(send 4 bytes, then (round trip latency later) receive 4 bytes) * 200]. Whether the message content is "ignored" is irrelevant.
Or, put another way, it's ridiculous for you to modify the Go program in that way (which will very likely send and receive only a single TCP segment over the localhost "network") and report the faster time as if it means anything. If you modify both programs in that way, fine. But it's something completely different.
Closing the channel is fine, I suppose, although the supervising goroutine now looks a bit odd:
select {
case _, _ := <- doneChannel
// Other goroutine is now done
It's so implicit that you pretty much have to add a comment to the effect of "this will trigger when the channel is closed", whereas the "case <- doneChannel" is so obvious it doesn't need explaining.
Also, I rather prefer the supervising goroutine to "own" the channel, so it should be the one to close it.
> you ca't just have a defer block without a function invocation
Yeah, I was not thinking Go there for a moment. Should have been "defer func() { doneChannel <- true }".
> 6. Didn't try to convince anyone runtime.GOMAXPROCS(runtime.NumCPU()) was a good idea (it's not)
So what is an appropriate GOMAXPROCS? As someone who has only dabbled in a few Go tutorials, I would imagine that you would want GOMAXPROCS to be NumCPU() (or even greater) so the goroutine thread pool could "fire on all pistons". Why does Go's scheduler default to GOMAXPROCS=1 instead of NumCPU()?
Or turning that around, do you think it requires all 8 of my cores to copy data over the network? Do you think the two lines of code + justification text provides sufficient value for this application to distract from the point of it in order to show why someone should override the default behavior?
Do you believe users shouldn't have any control over the number of cores any particular application consumes?
Have you measured the CPU contention of the application and determined that using more cores is worth the overhead of increased overhead of multi-thread exclusions (vs. more simple things happening directly in the scheduler)?
Overall, it has nothing to do with this article and now even more people are going to copy it in more unnecessary places as a cargo-cult "turbo button" for their programs.
If you are going to use an idiom like that, the least you could do is check for the GOMAXPROCS environment variable and only do this as a default when the user hasn't specified otherwise.
In my experience the majority of well-written concurrent programs become I/O-bound on a single processor. It's pointless to add more processors to that, and can only slow you down, and Go programs are far better behaved in the non-parallel case.
At other times you should think about the number of processors you want to occupy. If the objective is to behave like an appliance, then 1:1 schedulers:cpus is not a bad ballpark.
The default is 1 isn't it? As I pointed out, this will serve the majority of concurrent code.
The best number of processes to use is equal to the parallelism of the solution. Even with highly concurrent problems, this is still most often 1. If you get it wrong performance will suffer. But in practical terms we have more to worry about, and if you're talking to the disk and the network more than you're computing, parallelism will only increase the contention on those resources. The extra processes will consume more CPU without doing any more useful work.
So the default is pretty good.
By the way 1:1 isn't the limit either. Sometimes you will want more. If the problem truly is parallel enough to exceed your CPUs, you may want additional processes anyway. This will keep things up to speed thanks to the host's scheduler which is typically preemptive, unlike Go's. This sometimes works much better if you can pick and choose which routines run on which schedulers, and I'm not sure if Go exposes that.
I have no idea what my response was about. As I said, I was pretty sick that day. Sorry you had to type all this stuff to explain to me that I'm a moron. :)
This is great feedback. You seem to get what I'm going for.
Thoughts on your specific items:
1. I could probably prioritize the query/doc processing and get most of this out of the way, or something like what I've been thinking about for #4.
2. I've thought about this one for sure. It's actually possible to do externally already, just not very magically. I'll learn more when I get more internal people pushing it.
3. I've been tempted to add replication -- not because I need it, but because it's just really easy. master-slave is completely trivial. master-master isn't hard, but requires a tiny bit of state to be tracked I don't have an easy way to do yet. It'd be worth it just for fun.
4. I have a lot of infrastructure for this. To be efficient, I need something like _all_docs that doesn't include the values and/or something like get that evaluates jsonpointer. Then you could pretty well round-robin your writes and have a front-end that does this last-step work. Harvest the range from all nodes concurrently while collating the keys. Once you find a boundary from every node, you have a fully defined chunk and can start doing reductions on it. A slightly harder, but more efficient integration is to have two-phase reduction and let the leaves do a bunch of the work while the central thing just does collation. You wouldn't be able to stream results in that scenario, though.
5. Is this as simple as disabling DELETE and PUT (where a document doesn't exist)?