I think writing Macbeth isn't a good indicator of anything except being Shakespeare. Asking it a few questions about some general concepts like logic, math, and patterns(relationships in Macbeth maybe?) would probably yield results that are much more likely to pass the test.
I disagree on all counts. The first program that would be generated by such a procedure would very likely be the constant program that happens to always respond with the correct questions to your test. It depends how you randomly generate code, of course, but I doubt the complexity of a "general AI program" would ever be smaller than the complexity of the constant program which by luck returns the correct answer to your test.