I suspect those numbers have nothing to do with the cost of the system creating a thread or a process, but instead are artifacts of how Python handles threads.
When Python forks a process using the multiprocess module, that process can execute concurrently with the parent process. On a mutlicore machine, it can be simultaneous.
When Python spawns a thread, the thread and the parent process cannot execute concurrently. They need to grab the Global Interpreter Lock (GIL). Whoever has it can execute. Whoever does not must wait.
So, I suspect that what we are seeing is that even though the new processes/threads have very little work, the processes can exit faster because they don't have to wait for the parent process to give up the GIL. This is a misguided experiment.
Yes, I suspect that you're right; I also thought it might be due to the GIL.
I reran this test with Python 2.7 and it no longer appears to be true:
Spawning 100 children with Thread took 0.03s
Spawning 100 children with Process took 0.28s
I'm not sure to what extent the GIL was improved in 2.7, but it's possible that it was never the cause to begin with.
Regardless, I don't think it's a misguided experiment – it was an objective observation. It shows that things aren't so black and white depending on your toolchain.
I think the experiment is misguided for several reasons. One, process/thread creation time is negligible. The general approach is to create worker threads/processes that live for the lifetime of the program. Then you farm work out to them as needed. This separates the concept of "doing work" from their actual execution.
Two, threads don't buy you parallelism in Python, unless the majority of the work is being done in C modules.
Finally, this test is really just testing the multiprocess and thread packages provided by Python. I say this is misguided because the way the author talks about it, I don't think he understands that the difference between those abstractions and OS threads and processes. (Which, of course, are an abstraction as well.) I suspect the Python overhead will be more than the difference in cost between forking OS-level threads and processes.
When Python forks a process using the multiprocess module, that process can execute concurrently with the parent process. On a mutlicore machine, it can be simultaneous.
When Python spawns a thread, the thread and the parent process cannot execute concurrently. They need to grab the Global Interpreter Lock (GIL). Whoever has it can execute. Whoever does not must wait.
So, I suspect that what we are seeing is that even though the new processes/threads have very little work, the processes can exit faster because they don't have to wait for the parent process to give up the GIL. This is a misguided experiment.