There is no independent scalar floating point unit for most modern CPUs. When scalar floating point arithmetic is needed, it is send to the SIMD unit. This pretty means that scalar and vectorised floating point operations usually have the same latency. If you do any scalar floating point operations, the CPU is just doing vectorised operations except with only 1 useful value.
Is it really true that there's no scalar FPU at all? What about x87?
The instructions are still there even in 64-bit long mode, they use their own registers, and there are enough idiosyncrasies (80-bit extended-double precision, stack-based operations, etc.) that I would expect it to be easier to just include a dedicated scalar x87 FPU than try to shoehorn x87 compatibility into the SIMD units.