I just ran some tests on CNL and indeed the behavior is very different than earlier chips. I am seeing ~15 cycle divs with no pipelining (i.e,. the latency and inverse throughput are both 15), versus 36+ cycles latency and 25+ cycles inv throughput on Skylake.
Interesting. I found only a few other changes beyond that, so far.
Interesting. I found only a few other changes beyond that, so far.