I've been reading through this bit by bit the last few days, mostly to get a handle on how to implement MCMC for Bayesian posteriors, and I have to say its fantastically written. I wouldn't call it comprehensive or unbiased, but it sets up the infrastructure of interrelatedness between noisy channels, information theory, statistics, and machine learning pretty much as effortlessly as possible.
Note: I'm buying it entirely because it has wide margins. Many of the calculations he outlines deserve to be worked out in full. Wide margins are absolutely the most important publishing concern for a math/science/engineering-based text.
If you want to learn how to implement MCMC I recommend:
Bayesian Logical Analysis Physical Sciences by Gregory
Gregory's book explains a lot more of the engineering (autocorrelations, step size jumping, etc..). Even better, it discusses how to perform model selection using a clever annealing technique. Though model selection may not be of interest to you.
ps - MacKay's book is my nightly reading, so I'm not dissing MacKay :)
Fantastic book. The problems are interesting, and nicely bring out the connections between topics that would on the surface, seem to be disparate.
Cover and Thomas is more textbookish, and in some ways, more detailed. Personally, I'd read this first, and then take on the interesting topics in Cover and Thomas.
I read a lot of math books, and I'd put this right on top along with Needham's 'Visual Complex Analysis'.
One of my favorite machine learning textbooks, if you can call it that. It's a little oddball though. For any topic, it's extremely interesting and insightful, though usually not comprehensive enough to rely on it as a reference.
I've said it before and I'll say it again: this book is great. This books provides a really great foundational understanding of ML, not the toolbox approach of other textbooks. I think the exposition of coding theory is especially nice compared to, say, Cover and Thomas.
I think to say the book has a heavy focus on frequentist statistics is a little misleading. Jaynes discusses a lot of frequentist methods but the emphasis is on doing so from a very Bayesian point of view.
He uses Bayesian methods in some cases, and non-Bayesian method in others. For instance on page 1412 he describes a problem for which Bayesian methods are "not appropriate."
http://omega.albany.edu:8008/ETJ-PS/cc14g.ps
This is a bit of an anathema to purist Bayesians like Radford Neal who say that Jaynes Maximum Entropy method is not consistent with Bayesian methods and that it "doesn't make any sense"
At Cambridge, you find high-level theoretical machine learning research in the Engineering, Computer Science and Physics faculties. (My spies inside the DAMTP don't point to much going on there - but who knows). None of the above ever seem to talk to each other, sadly.