More

dlsspy · on Nov 20, 2013

Should also note that the "arrays are values" thing doesn't actually matter in practice. I've written a wide variety of go code in the last > 4 years and I can only think of two places I've used an array.

dlsspy · on Nov 20, 2013

This is very much false:

While this will return immediately:

    v, ok := <- c

chimeracoder · on Nov 20, 2013

Yeah, "ok" here really signals whether or not the channel is closed. It will still block unless c is buffered.

There are actually a number of other errors in this post, unfortunately, even though I agree with the general conclusion.

dlsspy · on Nov 20, 2013

"<-c", regardless of what's on the left side of it will block in every case until there's a messages available in the channel (note that close is a message).

I just notice this section has been removed. I'll be less angry now. :)

dlsspy · on Aug 6, 2013

It's not obvious to me what that does. Does that only service one of the 100 clients or one of their 10,000 pings each? Or both?

luikore · on Aug 6, 2013

It's one at a time. But, since the server only performs a one-shot task without the "hang on and wait 30 seconds"-like long connections, and, the default socket backlog of TcpServer is > 100, so every client gets served within the delay of (0~99)*6ms.

In short, every round is fully served, and the concurrency level is 100.

dlsspy · on Aug 6, 2013

Did you consider running the go client against the scala server and vice versa?

Also, that's kind of a lot of code. Here's my rewrite of the server: http://play.golang.org/p/hKztKKQf7v

It doesn't return the exact same result, but since you're not verifying the results, it is effectively the same (4 bytes in, 4 bytes back out). I did slightly better with a hand-crafted one.

A little cleanup on the client here: http://play.golang.org/p/vRNMzBFOs5

I'm guessing scala's hiding some magic, though.

I made a small change to the way the client is working, buffering reads and writes independently (can be observed later) and I get similar numbers (dropped my local runs from ~12 to .038). This is that version: http://play.golang.org/p/8fR6-y6EBy

Now, I don't know scala, but based on the constraints of the program, these actually all do the same thing. They time how long it takes to write 4 bytes * N and read 4 bytes * N. (my version adds error checking). The go version is reporting a bit more latency going in and out of the stack for individual syscalls.

I suspect the scala version isn't even making those, as it likely doesn't need to observe the answers.

You just get more options in a lower level language.

jongraehl · on Aug 6, 2013

I think you're on the right track in supposing that there can't be a huge performance difference in such a simple task, given that both languages are compiled and reasonably low-level. The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT. Your suggestion to try server-{a,b} x client-{a,b} is also a good one.

Your modified Go server doesn't return "Pong" for "Ping". It returns "Ping". And the "a small change" version is nonsense. It's fundamentally different. - you're firing off all your requests before waiting for any replies, and so hiding the latency in the more common RPC style request-response chain, which is a real problem.

You speculate a lot ("hiding some magic" "likely doesn't need to observe the answers") when you haven't offered any insight.

EDIT: Nagle doesn't matter here - it doesn't delay any writes once you read (waiting server response). It only affects 2+ consecutive small writes (here I'm trusting http://en.wikipedia.org/wiki/Nagle's_algorithm - my own recollection was fuzzy). If Go sleeps client threads between the ping and the read-response call then I suppose it would matter (but only a little? and other comments say that Go defaults to no Nagle alg. anyway).

bsdetector · on Aug 6, 2013

> The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT.

Really, the most plausible explanation? I'd say the most plausible explanation is that M:N scheduling has always been bad at latency and fair scheduling. That's why everybody else abandoned it when that matters. It's basically only good for when fair and efficient scheduling doesn't matter, like maths for instance, which is why it's still used in Haskell and Rust. I wouldn't be surprised to see Rust at least abandon M:N soon though once they start really optimizing performance.

coolj · on Aug 6, 2013

Interestingly, both the go client and the scala client perform the same speed when talking to the scala server (~3.3s total), but the scala client performs much faster when talking to the go server (~1.9s total), whereas the go client performs much worse (~23s total, ~15s with GC disabled).

I thought the difference might partly be in socket buffering on the client, so I printed the size of the send and receive buffers on the socket in the scala client, and set them the same on the socket in the go client. This didn't actually bring the time down. Huh.

My next thought was that scala is somehow being more parallel when it evaluates the futures in Await.result. Running `tcpdump -i lo tcp port 1201` seems to confirm this. The scala client has a lot more parallelism (judging by packet sequence ids). Is that really because go's internal scheduling of goroutines is causing lock contention or lots of context switching?

And...googling a bit, it looks like that is the case: https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sL...

> Current goroutine scheduler limits scalability of concurrent programs written in Go, in particular, high-throughput servers and parallel computational programs. Vtocc server maxes out at 70% CPU on 8-core box, while profile shows 14% is spent in runtime.futex(). In general, the scheduler may inhibit users from using idiomatic fine-grained concurrency where performance is critical.

diakritikal · on Aug 6, 2013

Bear in mind that was written before Go 1.1, additionally Dimitry has made steps to address CPU underutilization and has been working with the rest of the Go team on preemption. I think these improvements will make it into Go 1.2, fingers crossed.

jonstout · on Aug 6, 2013

Best response here. I spent weeks trying to get a go OpenFlow controller on par with Floodlight (java). I finally gave up on tcp performance and moved on when I realized scheduling was the problem.

jongraehl · on Aug 6, 2013

Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?

The comments on the article page have a different report which doesn't suffer from this implausibility:

go server + go client 22.02125152

scala server + scala client 3.469

go server + scala client 3.562

scala server + go client 4.766823392

coolj · on Aug 7, 2013

> Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?

I've been curious about that as well. The major slowdown seems to be related to a specific combination of go server and client. I don't have a good explanation. I'd love to hear from someone familiar with go internals.

> go server + go client 22.02125152 > ... > scala server + go client 4.766823392

That's roughly equivalent to my numbers.

jongraehl · on Aug 6, 2013

I'm curious: are you saying Go is M:N and JVM is not? I had to look up M:N - http://en.wikipedia.org/wiki/Thread_(computing)#M:N_.28Hybri... - but ultimately I don't know anything about JVM or Go threading, and your comment didn't go enough into detail for me to follow your reasoning.

bsdetector · on Aug 6, 2013

Yes I forget the audience. Go uses M:N scheduling meaning that the OS has M threads and Go multiplexes N of its own threads on top of these. The JVM uses N:1 like basically every other program where the kernel does all scheduling.

The basic problem with M:N scheduling is that the OS and program work against each other because they have imperfect information, causing inefficiencies.

gngeal · on Aug 6, 2013

Yes, but can Go actually use anything else? Finely-grained concurrency after the CSP fashion, after all, is the whole driving force behind it, and it's in the language spec.

jongraehl · on Aug 6, 2013

Are hybrid approaches worth it (exposing some details so that Go network server can get the right service from the OS)? I'm not sure how much language complexity Go-nuts will take, so they'll probably look for clever heuristic tweaks instead.

pcwalton · on Aug 6, 2013

You can turn off M:N on a per-thread (really per-thread-group) basis in Rust and we've been doing that for a while in parts of Servo. For example, the script/layout threads really want to be a separate thread from GL compositing.

Userland scheduling is still nice for optimizing synchronous RPC-style message sends so that they can switch directly to the target task without a trip through the scheduler. It's also nice when you want to implement work stealing.

bsdetector · on Aug 6, 2013

Can you just have 1 thread per running task and give the thread back to a pool when the task waits for messages? Then for synchronous RPC you can swap the server task onto the current thread without OS scheduling and swap it back when it's done. You just need a combined 'send response and get next message' operation so the server and client can be swapped back again. This seems way easier and more robust, and you don't need work stealing since each running task has its own thread... what am I missing?

pcwalton · on Aug 6, 2013

It doesn't work if you want to optimistically switch to the receiving task, but keep the sending task around with some work that it might like to do if other CPUs become idle. (For example, we've thought about scheduling JS GCs this way while JS is blocked on layout.)

willvarfar · on Aug 6, 2013

Is the OS not scheduling M runnable threads on N cores? Blocking/non-blocking is just an API distinction, and languages implement one in terms of the other.

laumars · on Aug 6, 2013

Goroutines are not threads. You can have a dozen goroutines which would only run on a smaller subset of OS threads.

willvarfar · on Aug 6, 2013

They are threads. Technically they are "green threads". The runtime does not map them to OS threads, although technically if it chose to it could, because goroutines are abstract things and the mapping to real threads is a platform decision.

shin_lao · on Aug 6, 2013

> The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT.

Configuration rarely impacts such trivial cases. I would rather bet on a thread affinity or page locality.

coldtea · on Aug 6, 2013

>Configuration rarely impacts such trivial cases.

Really? For example buffered vs unbuffered communication won't impact such a case?

One should only assume "thread affinity or page locality" after checking the configuration options (and maybe even later, after profiling).

coldtea · on Aug 6, 2013

>In that case since you are waiting for the answer at every iteration, I'm not sure I see how it could have an impact.

In this particular case, yes.

I was making a point against the more generic "configuration rarely impacts such trivial cases" you said, which I've not found to be the general case.

shin_lao · on Aug 6, 2013

The important part of my comment is such trivial cases ;)

coldtea · on Aug 6, 2013

Well, especially the "such", whereas I considered that "trivial cases" is the important one.

shin_lao · on Aug 6, 2013

Buffering impacts performance when it transforms many small writes into one big write (same for reads). In that case since you are waiting for the answer at every iteration, I'm not sure I see how it could have an impact.

dlsspy · on Aug 6, 2013

> Your modified Go server doesn't return "Pong" for "Ping".

The program doesn't read the result, so it doesn't matter. Returning Pong isn't harder, but why write all that code if it's going to be ignored anyway?

> It's fundamentally different. - you're firing off all your requests before waiting for any replies, and so hiding the latency in the more common RPC style request-response chain, which is a real problem.

As I said, the program isn't correlating the responses with the requests in the first place -- or even validating it got one. I don't know scala, but I've done enough benchmarking to watch even less sophisticated compilers do weird things with ignored values.

I made a small change that produced semantically the same program (same validation, etc...). It had similar performance to the scala one. If you don't think that's helpful, then add further constraints.

jongraehl · on Aug 8, 2013

Compilers do not restructure a causal chain of events between a client and server in a different process. It's very easy to understand this when you realize that send -> wait for response and read it will result in certain system calls, no matter the language.

[Send 4 bytes * 200, then (round trip latency later) receive 4 bytes * 200] is fundamentally different than [(send 4 bytes, then (round trip latency later) receive 4 bytes) * 200]. Whether the message content is "ignored" is irrelevant.

Or, put another way, it's ridiculous for you to modify the Go program in that way (which will very likely send and receive only a single TCP segment over the localhost "network") and report the faster time as if it means anything. If you modify both programs in that way, fine. But it's something completely different.

dlsspy · on June 28, 2013

We've done a few of them here. Notably:

* http://cbgb.io/

* http://dustin.github.io/2012/09/09/seriesly.html

* https://github.com/couchbaselabs/sync_gateway

cbgb is an API compatible Couchbase implementation in go. We use it in place of couchbase when we need something tiny to play around with.

seriesly is a time series database for storing and aggregating sample data and doing things like this: http://bleu.west.spy.net/~dustin/seriesly/

sync_gateway is how our mobile team synchronizes data across all your phones and tablets and your central DB.

dlsspy · on April 28, 2013

> but I think "<- true" looks nicer than "<- struct{} {}"

And what looks even better:

    defer close(doneChannel)

(it's also syntactically correct -- you ca't just have a defer block without a function invocation)

atombender · on April 29, 2013

Closing the channel is fine, I suppose, although the supervising goroutine now looks a bit odd:

    select {
      case _, _ := <- doneChannel
        // Other goroutine is now done

It's so implicit that you pretty much have to add a comment to the effect of "this will trigger when the channel is closed", whereas the "case <- doneChannel" is so obvious it doesn't need explaining.

Also, I rather prefer the supervising goroutine to "own" the channel, so it should be the one to close it.

> you ca't just have a defer block without a function invocation

Yeah, I was not thinking Go there for a moment. Should have been "defer func() { doneChannel <- true }".

redbad · on April 29, 2013

    > the supervising goroutine now looks a bit odd:

This is totally valid:

    select {
    case <-doneChannel:
        //

dlsspy · on April 28, 2013

There's so much code here.

If I weren't sick, I'd submit a new version that:

1. Didn't reimplement io.Copy

2. Didn't avoid io.TeeReader

3. Didn't do weird things to avoid regular channel ranges.

4. Didn't do non-standard date formatting.

5. Didn't reinvent the log package.

6. Didn't try to convince anyone runtime.GOMAXPROCS(runtime.NumCPU()) was a good idea (it's not)

In fact, maybe I will anyway. brb

dlsspy · on April 28, 2013

Here's the version I threw together in (what it says here), 12 minutes: https://gist.github.com/dustin/5478818

cpeterso · on April 28, 2013

> 6. Didn't try to convince anyone runtime.GOMAXPROCS(runtime.NumCPU()) was a good idea (it's not)

So what is an appropriate GOMAXPROCS? As someone who has only dabbled in a few Go tutorials, I would imagine that you would want GOMAXPROCS to be NumCPU() (or even greater) so the goroutine thread pool could "fire on all pistons". Why does Go's scheduler default to GOMAXPROCS=1 instead of NumCPU()?

dlsspy · on April 28, 2013

Or turning that around, do you think it requires all 8 of my cores to copy data over the network? Do you think the two lines of code + justification text provides sufficient value for this application to distract from the point of it in order to show why someone should override the default behavior?

Do you believe users shouldn't have any control over the number of cores any particular application consumes?

Have you measured the CPU contention of the application and determined that using more cores is worth the overhead of increased overhead of multi-thread exclusions (vs. more simple things happening directly in the scheduler)?

Overall, it has nothing to do with this article and now even more people are going to copy it in more unnecessary places as a cargo-cult "turbo button" for their programs.

If you are going to use an idiom like that, the least you could do is check for the GOMAXPROCS environment variable and only do this as a default when the user hasn't specified otherwise.

smosher · on April 29, 2013

In my experience the majority of well-written concurrent programs become I/O-bound on a single processor. It's pointless to add more processors to that, and can only slow you down, and Go programs are far better behaved in the non-parallel case.

At other times you should think about the number of processors you want to occupy. If the objective is to behave like an appliance, then 1:1 schedulers:cpus is not a bad ballpark.

dlsspy · on April 29, 2013

Then why isn't that the default?

smosher · on April 30, 2013

The default is 1 isn't it? As I pointed out, this will serve the majority of concurrent code.

The best number of processes to use is equal to the parallelism of the solution. Even with highly concurrent problems, this is still most often 1. If you get it wrong performance will suffer. But in practical terms we have more to worry about, and if you're talking to the disk and the network more than you're computing, parallelism will only increase the contention on those resources. The extra processes will consume more CPU without doing any more useful work.

So the default is pretty good.

By the way 1:1 isn't the limit either. Sometimes you will want more. If the problem truly is parallel enough to exceed your CPUs, you may want additional processes anyway. This will keep things up to speed thanks to the host's scheduler which is typically preemptive, unlike Go's. This sometimes works much better if you can pick and choose which routines run on which schedulers, and I'm not sure if Go exposes that.

dlsspy · on May 3, 2013

I have no idea what my response was about. As I said, I was pretty sick that day. Sorry you had to type all this stuff to explain to me that I'm a moron. :)

dlsspy · on April 28, 2013

goroutines are not threads because thread means something and goroutine means something else.

dlsspy · on Sept 12, 2012

This is great feedback. You seem to get what I'm going for.

Thoughts on your specific items:

1. I could probably prioritize the query/doc processing and get most of this out of the way, or something like what I've been thinking about for #4.

2. I've thought about this one for sure. It's actually possible to do externally already, just not very magically. I'll learn more when I get more internal people pushing it.

3. I've been tempted to add replication -- not because I need it, but because it's just really easy. master-slave is completely trivial. master-master isn't hard, but requires a tiny bit of state to be tracked I don't have an easy way to do yet. It'd be worth it just for fun.

4. I have a lot of infrastructure for this. To be efficient, I need something like _all_docs that doesn't include the values and/or something like get that evaluates jsonpointer. Then you could pretty well round-robin your writes and have a front-end that does this last-step work. Harvest the range from all nodes concurrently while collating the keys. Once you find a boundary from every node, you have a fully defined chunk and can start doing reductions on it. A slightly harder, but more efficient integration is to have two-phase reduction and let the leaves do a bunch of the work while the central thing just does collation. You wouldn't be able to stream results in that scenario, though.

5. Is this as simple as disabling DELETE and PUT (where a document doesn't exist)?

daemon13 · on Sept 18, 2012

Hi Dustin, I've sent you an e-mail to @spy.net to continue the discussion. Is this your working e-mail?

dlsspy · on Sept 11, 2012

What kinds of advanced things are you wanting to accomplish?

steve19 · on Sept 12, 2012

Finding trends for example