If your monolithic service OOMs, hits a large GC pause causing dependent request...

thrwawai39f3 · on Jan 4, 2023

A monolith can also scale vertically with mechanisms to redeploy on fatal errors. If all starts failing, you may have a problem. But you can get the same problems with a microservices that is in the critical path

Networks could have unexpected delays, routing errors and other glitches. At least with a monolith you can often find a stacktrace for debugging. I have seen startups that have limited traceability and logging when using micro services.

When a small startup has to manage "scalable" K8s infrastructure in the cloud, distributed tracing and monitoring is often not prioritized when you are a team of 5 developers trying to find a product market fit.

I am not against microservices (I work with them daily) but you just trade one type of stability problem with another

Karrot_Kream · on Jan 5, 2023

Right I'm not advocating for one over the other, I was just explaining issues solved by microservices. Now instead of the OOM Killer taking your service down, you have a flaky NIC on another microservice box and now you need to figure out how to gracefully degrade.

I love working with microservices at the scale of $WORK, but we're Big Tech. I can't imagine why a 5 person startup would want k8s and microservices. You don't need that scale until you have more than 2 teams, and you're pushing at the very least 15 engineers at that point and usually the sales and marketing staff to make that investment worth it.

TickleSteve · on Jan 4, 2023

"Classes of errors such as OOMs go away when multiple processes are executing"

You're going to have to explain what you mean by that a bit more... You surely cant mean it as it is written.

danielheath · on Jan 4, 2023

I don't think it was well expressed, but to reuse my last example: OOM-killer ending the recommendations process mid-request is less of a big deal if the main store server can keep running and serving traffic.

If the recommendations team write code that causes the OOM-killer to end their process, making them run it on separate infrastructure insulates your "main store team" from the bugs they write.

Karrot_Kream · on Jan 5, 2023

It was about the OOM killer as the sibling comment says, yeah. I'm surprised you're so incredulous. OOM Killer and GC stalls are some things I've run up against in my career frequently. I'm sorry my comment didn't live up to your expectations, it was hastily typed on mobile.

oblio · on Jan 5, 2023

His point was that the comment was unclear if you'd also read it hastily :-)

I imagine his logic was something like: "How can OOMs happen less often if you run more processes (possibly on the same machine)?", while your comment actually wants to say: "if a specific service is affected by an OOM, with microservices only that specific microservice goes down, since it's probably running on its own hardware".