That would certainly be better experimental design, since you would be controlling for other factors. On the other hand, precisely measuring the improvement in conversion isn't particularly important in this case; it's already clear that faster is better, so you're not gaining much actionable information from the measurement, whereas you would be giving half of your users a worse experience. In a situation where you were uncertain about which of two methods is better, it would definitely be better to run them in parallel like you've suggested so that you had a fair comparison.
You're right: 37signals don't have to do this test properly. Their prerogative. However, until they do, the 5% figure and implied causation are meaningless.
We don't know if their SD is in the same ballpark. Thousands of possible confounds. Plus, there's no solid a priori reason why shaving off latency should improve their conversion rates drastically -- Basecamp doesn't rest on a large number of small, potentially impulsive transactions like Amazon does. Without more data (or at least an explanation), this doesn't tell us anything.
I agree, and I meant to emphasize more clearly that I don't think the 5% figure is meaningful (and that it shouldn't have been stated without the appropriate caveats in the article).
My main point was simply that I don't think it's prudent to create a worse experience for half of your users when you're so unlikely to gain any actionable information from it. It would be quite extraordinary if they found out that the speed increase caused a decline in conversion or a huge improvement in conversion, so I think it's safe to say that the measurement wouldn't make them reverse the change or make them allocate significantly more resources toward faster load times.
But, to reiterate, I agree that they should not have made the claim about 5% conversion when it wasn't properly supported.
Yeah, serving up two versions simultaneously and split testing them would be more scientific, but I appreciate the number anyway. It was an after the fact observation rather than an original goal, so I wouldn't expect him to go back and deploy the slower version just to test the number.
The hole here is whether or not they unknowingly got a new influx of traffic that was 5% more likely to convert, skewing his final observation, which I would say is unlikely. Your point is good in general, however.