In case anyone finds this interesting (as it is so glaringly apropos to this "Buy Now" blog post): I ran an A/B test recently of "Buy Now" vs "Purchase" (which is a drastically different and more subtle question than comparing whether people want to install a demo vs. make a purchase), and found almost no difference.
Unfortunately, due to the usage of a multi-armed bandit algorithm (which attempts to "exploit" already learned knowledge to not lose sales during the test), my data is somewhat "skewed" (in that I have an order of magnitude more tests for one of the hypotheses), but here are the raw results:
"Purchase": 35,715 sales from 3,255,882 impressions (1.097%)
"Buy Now": 4,042 sales from 376,227 impressions (1.074%)
If you compare the confidence interfaces with a Beta distribution it is difficult to feel comfortable claiming Purchase is a winner, but that small benefit is why the algorithm kept trying to use it over the other case. Put differently: despite the large sample, I believe that tiny difference is not statistically significant.
(Additionally, for completeness, and as this is important for anyone who might care about this experiment: my app had previously said "Purchase", so there are likely guides online that tell the user how to buy things, or people may have had memories, which may have caused "Buy Now" to be ever so slightly more confusing.)
Did you control for the fact that the 'Try demo' button is red and the 'Buy now' button is blue? I'm sure I've read about A/B test studies (probably on the VWO blog itself), which showed red buttons had higher clickthrough rates than other colours, which could be a big confounding factor.
I just went to the website that did the test, and found that they currently have two buttons: "Try Demo for Free" and "Plans and Pricing", not just the "Try Demo for Free" button as in the winning test.
This makes sense to me - as a matter of fact, while reading the post I was thinking that before trying a demo I usually check the price I would have to pay in the end - but I wonder why this variation wasn't talked about in the article. "Plans and Pricing" isn't a call to action, but having that second button is for sure different than having just one and knowing which clickthrough it gives would be very relevant.
"selling" them on the demo version is not the same as selling the product out (which brings in direct cash in the bank and is the effective "end game").
You need to multiply the demo download conversion with the number who go ahead and make a purchase after trying said demo.
The fact that this was not done brings questions upon the quality of methodology of visualwebsitreoptimiser. Especially when they put out blog posts pimping the results. Up to this point I've been relatively happy with their work and contributions to the A/B field. Which makes this dodgy conclusion a bit of a shame really.
I wrote the case study. The results we publish actually depends on what our customers care for in those particular tests. We don't actively recommend them which tests to run (unless they ask for) so if they don't measure impact on sales (which we would have recommended had they asked for), we can't ask them for the data.
Most customers would know if their sales were negatively impacted during a test, so if you happen to increase demos while still not negatively impacting sales there's nothing wrong with that.
Yes, the customer hinted that their sales did increase and were very happy with results but they don't fully reveal this data (as it can be sensitive).
Unfortunately, due to the usage of a multi-armed bandit algorithm (which attempts to "exploit" already learned knowledge to not lose sales during the test), my data is somewhat "skewed" (in that I have an order of magnitude more tests for one of the hypotheses), but here are the raw results:
"Purchase": 35,715 sales from 3,255,882 impressions (1.097%)
"Buy Now": 4,042 sales from 376,227 impressions (1.074%)
If you compare the confidence interfaces with a Beta distribution it is difficult to feel comfortable claiming Purchase is a winner, but that small benefit is why the algorithm kept trying to use it over the other case. Put differently: despite the large sample, I believe that tiny difference is not statistically significant.
(Additionally, for completeness, and as this is important for anyone who might care about this experiment: my app had previously said "Purchase", so there are likely guides online that tell the user how to buy things, or people may have had memories, which may have caused "Buy Now" to be ever so slightly more confusing.)