When you're actually a part of conversations on purchasing a cluster (which I gather from the way you talk, you haven't been a part of) which cost $400k ~ $1M for a smallish/mid-sized, arguments like "I don't remember the numbers" "they're listed somewhere", "there's this some other random DFT code" aren't effective. The hard fact is (which I hate), you get results faster with MKL on Intel than any other alternatives. This is more so with the proprietary software that are golden standards.
> I have compared other DFT code on 64-core Bulldozer with 12-core Sandybridge nodes
And what's that comparison supposed to tell us, aside from the obvious fact that MPI introduces latency? That's just about the number of cores, not the performance of a each core. You need to compare 64-core AMD node against a 64-core Intel node.
I don't remember, because the measurements on the £1M purchase was maybe five years ago, but they taught a useful lesson. I didn't see figures in what I was responding to. If I'd had more influence on the purchase, as opposed to observing the process, we wouldn't have ended up with a pure sandybridge system, which was a mistake. Anyhow, my all-free-software version of cp2k was faster on it than an all-Intel version on slightly faster CPUs on an otherwise equivalent cluster. I measured and paid attention to the MPI, which benefited everything using alltoallv. The large core-count AMD boxes were simply a better bet for the range of work on a university HPC system. It's not as if most codes topped out an arithmetic intensity and there was a serious problem with serial performance, even if MKL had been significantly better than the free libraries, which it wasn't.
For a recent exercise, spending rather more money on AMD CPUs for the UK Tier 1 system, look at the Archer2 reference and benchmarking for it. It's expected to run large amounts of VASP-like code; www.archer.ac.uk publishes usage of the current system. Circumstances differ, and I'm pointing out contrary experience, understanding the measurements and what determined them.
> I have compared other DFT code on 64-core Bulldozer with 12-core Sandybridge nodes
And what's that comparison supposed to tell us, aside from the obvious fact that MPI introduces latency? That's just about the number of cores, not the performance of a each core. You need to compare 64-core AMD node against a 64-core Intel node.