Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Benchmarking OS primitives (bitsnbites.eu)
72 points by mortenlarsen on March 21, 2018 | hide | past | favorite | 52 comments


I can't help but start a rant about Windows upon reading this.

One of my major complaints with Windows is that things just 'feel slow'. I have to wait very often. Opening an FTP location? Wait for 5 seconds, (and it also opens in a new window, leaving the old window open in an unusable state - very confusing). Starting a GUI? Wait for 5 seconds.

My laptop and raspberry pi at home both work a lot smoother than the hi-end (it's a brand new Dell XPS machine with 8 GB RAM - which I consider hi-end) laptop I have at work.

I still find it hard to comprehend that people are buying ridiculously overpowered Windows computers for tasks like browsing and document editing. Developers are at fault too - if it runs smoothly on your $1000+ machine with 32GB RAM, that does not mean that the average user will be able to even use it. Everyone and their mother is jumping at the sustainability hype, but at the same time developers assume that everyone buys a new computer and phone every other year, for the same tasks we've been doing for decades. Once you realize this it's hard to use a Windows system and not cringe at the mess of laggy/unresponsive GUI's.


On the same computer (FX8370E, 16GiB DDR3, no SSD, RX580 gpu), Kubuntu 17.10 rans far fast that Windows 10.

For example, time from boot to allow do something like browsing a web page. I don't did a precise measurement of time, but on Windows 10 I need to wait like 10 fucking minutes to allow to do something! And the hard disk doing a lot of horrible noises, so Windows must doing something. On linux, would take like a single minute or two, and I don't noticed any noticeable hard disk activity.

I don't know that is messed with Windows (probably I messed something on Windows), but I really hate this.


> Starting a GUI? Wait for 5 seconds.

You're holding it wrong.

> My laptop and raspberry pi at home both work a lot smoother than the hi-end (it's a brand new Dell XPS machine with 8 GB RAM - which I consider hi-end) laptop I have at work.

If your RPi is faster at comparable tasks than that Windows PC, your Windows PC has some extremely serious setup problems.

(if your employer uses anything like the commercial security software mine does, that's one potential problem)


I'd argue that it's more of a culture problem. I regularly have multiple instances of Visual Studio and Visual Studio Code open (together with many windows explorer, notepad, notepad++ and whatever). Visual Studio in itself is just slow, and often crashes, even if I just have one instance open.

On Linux, I have several terminals open, which is just so much more lightweight.

Of course, from a performance point-of-view, these are not comparable, but that is exactly my point. On windows, everything has a GUI, and everything seems to assume a much more hi-end machine.

P.S. I admit that it was a bit misleading to post this rant under this article, since the article measures raw OS performance, while my point is that lower performance is usually sufficient too, if you use tools that have a single purpose instead of trying to be an OS in itself.


Some of it's culture and some is certainly bloat. VS has slowed down a lot in the last eight years, and VS Code was born slow.

But the performance between the RPi and the Windows machine shouldn't be comparable at all, and if the RPi is coming out ahead for anything more trivial than opening a terminal window, I'd look at the setup of the Windows machine. There is plenty that can get effed up there.


> You're holding it wrong.

OK, but .. how exactly? It's going to be Windows Defender, isn't it?


I have to admit that I immediately thought of the Symantec enterprise software that serves roughly the same purpose, but much more intrusively. That's the sort of thing that seems to kill performance on something with a lot of file opens and closes, like VS, in addition to slowing down app startup. It's popular with businesses.

The oddball feature that causes Windows disk I/O to basically lock up on occasion, and I'm amazed that they haven't turned it off by default by now, is volume shadow copy and the automatic creation of restore points. It's bad on an SSD because it's very slow and uses a lot of space, and on a hard disk it's so slow that it ought to be criminal.


Ever notice how much slower building software using configure is on MacOS than Linux? The results here point out why: Fork + exec is ~10x slower on MacOS.

However, this isn't exactly new information. The general slowness of the OSX kernel has been known for years, via other benchmarks like lmbench. Its one of the reasons they were the first to implement a vdso-like interface for things like gettimeofday().


Not very objective, since the operating systems were in unknown state, i.e. there was a third party antivirus installed on one Windows machine. In such conditions this benchmark doesn't provide any meaningful information.


I find the article very informative from point of view of developing software packages installable by the end user.

Benchmarking typical environments vs artificially lean is much more helpful in practice.


I didn't think that these benchmarks would get this much attention. I personally made them because I could not believe that identical Windows and Linux machines were performing so different in typical software development tasks (Git, CMake, GCC, file copying, ...). I threw in a few more machines (Raspberry, Mac, ...) into the mix to get some perspective.

Most machines were stock configured (Ubuntu ext4 install for Linux, Windows 10 w/ Windows Defender, macOS X stock install, etc), so I believe that they should be representative for average users.

The benchmark suite is open source and easy to run on your own hardware if you like to get more accurate/representative figures for a particular setup.


Depends on what the measurement is for. If you want to help developers pick a faster development setup, you want something realistic.


Frome experience git is slow on windows when dealing with 10's of thousands of files even with a SSD drive. This is due to the filesystem (NTFS) being rather slow, especially for various stat operations. (If you watch in windows taskmanager this shows up in the "other i/o" column -- not reads or writes.


I am not sure if it is correct to attribute it to NTFS, rather than the Windows VFS layer. IIRC, NTFS is a reasonably sane filesystem.


I don't know why, but NTFS metadata performance just was never very good. I always assumed that was due to the more complex/capable data model compared to what Linux usually uses, but I'm not so sure about that any more given the complexity and yet still better performance of ZFS (btrfs remains too much of a mixed bag to mention here).

However, poor or bad applications also play a role here. For some reason explorer.exe requires 1-2 orders of magnitude more I/O time than dir.exe, and somehow the local search is slower than a human binary search. However they managed to do that bad of a job remains a mystery.


Yeah, poor search performance is definitely an application problem here. Lately I've taken to using Cygwin find instead and it runs just as fast as you'd expect. eg

find /cygdrive/c/ -type f -iname '*somefile.tar'


The memory allocation test seems bit out of place, considering that the allocator is provided by libc and not the OS. Testing something like mmap/virtualalloc might have made more sense


You are not wrong, but a) at least on unix the libc is certainly considered part of the OS and b) malloc has to get the memory from the OS eventually, via sbrk or mmap.


I'm not sure I agree with (a), any app that is focused on perf is likely going to use jemalloc or any of the other high performance allocators (tcmalloc), and completely bypass the libc implementation...


> a) at least on unix the libc is certainly considered part of the OS

Maybe on proper UNIX (and BSDs) where you have the whole base system as more or less one package, but on Linux certainly not. Honestly the whole concept of "OS" does not make that much sense on traditional Linux distros. You could consider a whole distro an OS, but then the phrase is so all-encompassing that it becomes meaningless. On the other hand calling just the kernel "OS" is not quite right either.

> b) malloc has to get the memory from the OS eventually, via sbrk or mmap.

Yes, eventually and occasionally. But there is fairly big disconnect between malloc and OS. You got me curious, so I did run the code from the article with few different NUM_ALLOCS values to see how it behaves. These are the results from glibc malloc on my system:

    Benchmark: Allocate/free 10000000 memory chunks (4-128 bytes)...
    70.310378 ns / alloc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.023622           4      5996           brk
    
    Benchmark: Allocate/free 1000000 memory chunks (4-128 bytes)...
    171.745062 ns / alloc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.250319          17     15004           brk
    
    Benchmark: Allocate/free 100000 memory chunks (4-128 bytes)...
    33.700466 ns / alloc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.000203           3        63           brk
    
    Benchmark: Allocate/free 10000 memory chunks (4-128 bytes)...
    30.589104 ns / alloc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
      0.00    0.000000           0         9           brk
    
    Benchmark: Allocate/free 1000 memory chunks (4-128 bytes)...
    26.941299 ns / alloc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.000008           2         4           brk
Note how the syscall count blows up at NUM_ALLOCS=1000000, which just happens to be the original value from the article. Yes, I checked and glibc did not fall back to mmap in any of these cases. You can already start seeing why this might not be the best of benchmarks.

Then just for fun, I tried using jemalloc, which is a drop-in replacement for standard malloc. These are the results:

    Benchmark: Allocate/free 10000000 memory chunks (4-128 bytes)...
    68.486404 ns / alloc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.000186           4        47           mmap
      0.00    0.000000           0         2           brk
    
    Benchmark: Allocate/free 1000000 memory chunks (4-128 bytes)...
    59.097052 ns / alloc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     96.45    0.000544          16        33           mmap
      3.55    0.000020          10         2           brk
    
    Benchmark: Allocate/free 100000 memory chunks (4-128 bytes)...
    54.659843 ns / alloc
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
      0.00    0.000000           0        24           mmap
      0.00    0.000000           0         2           brk
Well, well, well. It certainly paints a very different picture.


Lots of confounds due to non-uniform hardware, etc, but more importantly these are very artificial micro-benchmarks; systems are (ideally) tuned for performance on the sorts of loads that they will actually be running under, not artificial tests like "create 65k files of 32B each".

In artificial tests like these, you frequently get the best performance by flushing data out as fast as possible, while in most "real-world" scenarios you have some temporal locality that makes keeping data around a win. Optimizing for these sorts of benchmarks can actually harm performance.

Still, fun.


"Launching Programs" should use posix_spawn at least on macOS, it's a distinct syscall there and faster than fork + exec.


Microsoft really need to fix windows defender and search indexing. I have myself benchmarked horrible pains similar to this, 7-10x slowdowns doing things like copying many files around. It can make the Linux Subsystem almost unusuable.


Benchmarking things like file creation without noting the filesystem or how it was mounted is... interesting.


Yeah also SELinux / AppArmor state would be interesting to know as well.


Anyone else see the create file test results and think of a node_modules joke?


Ah yes, Windows Defender. I always forget about it until it makes some trivial operation take 5x too long. Make sure to add your compilers and build tools to the exclusions list.


That's the easy case. In a typical enterprise environment a software developer may be faced with several roundtrips to the centralized/outsourced it support to get their build folders white listed in the company approved (and forcibly installed) AV software to make a simple CMake run take less than 10 minutes (something that takes about 5 seconds on a stock Linux machine).


If you haven't already, I highly recommend reading "I Contribute to the Windows Kernel. We Are Slower Than Other Operating Systems. Here Is Why."

http://blog.zorinaq.com/i-contribute-to-the-windows-kernel-w...


It's awful that something as misleading as that anonymous, superficial rant is held up as important. The only lesson there is pretty meta: you actually can issue a retraction for a rant, but nobody will care.


Not surprised. Note the Linux results, while looking good compared to what's even worse, are still terrible.

Missing are the results for the BSDs. I'm particularly interested on Dragonfly BSD. Maybe I'll try them myself when 5.2 is out, which will be soon.


Awesome work.

Otherwise, the filesystem bench is pointless without:

- SSDs type. Tlc? SLC? Are they same, different?

- Linux filesystem type and fstab flags.


SSD types: for the machines that are identical, they are the same

Linux filesystem: stock ext4


Would the Windows-equivalent of fork() be CreateProcess() ?


No, because the semantics aren't the same.

CreateProcess() is like posix_spawn(), or if you prefer fork()/exec().

Windows is a thread based OS, not process based, hence why the focus on thread performance, not on process creation.


> Windows is a thread based OS, not process based, hence why the focus on thread performance, not on process creation.

Which, somewhat ironically, leads NT to have worse numbers in the create thread test than linux in the create process one (25.6us vs 18us).

The redeeming factor of NT is their async IO model which afaik is the best among mainstream OS.


IOCP is very complicated to code against, though. kqueue can do nearly all the same things and is both much cleaner and more portable.


It's a very different paradigm to wrap your head around, but once you grok the NT kernel's approach to I/O (packet based IRPs, inherently asynchronous, thread-agnostic), and thread scheduling, I/O completion ports are very powerful constructs.

The key difference is that I/O completion ports can be used to achieve asynchronous I/O on any underlying object, e.g. files and sockets, and they have this nifty built-in concept of concurrency, such that the kernel can ensure there is always one running thread per CPU core (which is optimal from a scheduling perspective).

You can't use file descriptors with epoll/kqueue, and you certainly can't say "ensure every core only has one active thread running".

"The key to understanding what makes asynchronous I/O in Windows special is...": https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...

"Thread-agnostic I/O with IOCP": https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...


There is no such thing as a "thread based OS". The statement simply doesn't make sense.

The concepts of processes and threads work just the same in Linux and Windows (and internally just map to the execution unit of the scheduler, together with resource mappings and privileges), and user-space expectations are similar for the two. The main difference is that fork() is not available on Windows, but fork() is a terrible idea anyway.

Fast spawn of processes isn't used for performance critical things on either OS, as process spawning is considered slow on Linux and entirely useless on Windows. Fast spawn of threads is also generally avoided, as even that is usually considered too slow.

Windows is slow at creating processes (and most other things involving the kernel) not because of differences in OS use-case, but simply due to performance apparently not being a priority for Microsoft.


Then you should spend some time educating yourself about such OSes, like Windows.

A thread based OS is an OS where threads are the core unit of execution, and processes are just a kind of execution capsule with one thread executing by default.

The kernel scheduler only understands threads.

This by opposition to process based OSes like UNIX, where there is a clear distinction between a process and thread execution.

The kernel scheduler handles processes and threads separately.

In many UNIX platforms, a process that doesn't perform any thread related API call, won't have any thread running on its context.

This was quite clear during the days when UNIX systems where still researching how to adapt threads into the process execution model.

And in many cases the impedance mismatch is still visible in modern UNIX systems, like for example what happens to any given thread when a signal is triggered, or to the whole process when a thread decides to fork.

You can start by getting yourself a copy of "Windows Internals" book.

Here is an old version of "Processes, Threads, and Jobs in the Windows" chapter in the 5th edition.

https://www.microsoftpressstore.com/articles/printerfriendly...


At least on Linux, processes and threads are not that much different. You create both with clone(2) call which accepts a large number of flags (see https://linux.die.net/man/2/clone ). Depending on the flags you pass, you can get a new thread, or a new process, or something inbetween. For example, how about a "thread" which has it's own memory space (~CLONE_VM)?


I know, but that is very Linux specific and not something that can be generalized in a portable way across UNIX implementations.


You cannot generalize scheduling primitives across UNIX implementations.

To quote the FreeBSD manual: "Traditional UNIX® does not define any API nor implementation for threading, while POSIX® defines its threading API but the implementation is undefined."

macOS was at least temporarily considered a true UNIX, and the scheduling primitive there is a Mach task. I frankly don't remember much about FreeBSD anymore, but I would assume that the unit of scheduling there is somewhat identical to that of Linux... Just implemented nicer.


That is the whole point, Linux != UNIX, yet every time someone discuss some UNIX feature, there comes a Linux example as if Linux would the be representative of how UNIX works.


That's not what happened here. In this case, someone discussed Windows, Linux and macOS features and implementation details, and someone (you) came around and started talking about UNIX. ;)


You seem to be distracted by terms rather than their meaning.

The primary difference between Windows and Linux (which is the topic at hand, not other Unixes or esoteric OS's) is in terminology. A Windows "process" is not the same as a Linux "process". A Windows "thread" is not the same as a Linux "thread".

However, a Windows "execution resource" ("thread") is quite identical to a Linux "execution resource" ("process"), or the macOS "execution resource" ("Mach task", not "process" or "thread").

Windows and Linux also have "execution resource groups", in the form of "process" and "thread group"/"parent process", respectively. They are implemented slightly differently (dedicated device vs. "master" execution resource), but the end-result is similar.

These constructs implement identical functionality for all intents and purposes (the differences are just in some minor limitations and API choices). The scheduler only operates on the execution resource, but might look at the execution resource group when making scheduling decisions. This is shared between all the OS's, and is a minor implementation detail that can change between releases.

The distinction between "thread-based" and "process-based" does not exist. Linux is a heck of a lot faster to create "resource groups" than Windows is, but that is due to better code, not fundamental design limitations.

(Of course, an esoteric OS might implement something entirely different from the concepts of "processes" and "threads", but that's a fun discussion for another day—the important thing is the contemporary OS's are all the same.)


Note that in linux, threads are the unit of scheduling, not processes.


Conceptually yes, but note that in Linux there is no such thing as a thread. Only processes. Some of those processes happen to share an address space and belong to a thread-group (TGID), but they are all processes.

This only really matters when you're trying to understand how PID, PPID, and TGID fit together and why there is no TID, though.


True, but Linux and UNIX are not the same thing, even though many keep mixing it up.


Actually, Linux is very explicit about not being a UNIX.

However, the topic would appear to be Windows, Linux and potentially also macOS. That's what the benchmarks are about. No one mentioned other OS's.


Which is why no one should use it as an example how UNIX works.


No one is! You're the one who brought up UNIX. The rest of us is commenting on a benchmark of specifically Windows, Linux and macOS. Only one of those was temporarily considered UNIX, and most disagreed with the label.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: