The fact they can't capitalize on the current trainwreck of GitHub speaks volumes. If they had the right product people would be throwing money at them.
Most companies signing up to the idea that GitHub will fix their issues, rather than going through operational pain of migration. Everyone that I know jokes about GH downtime, but have zero internal talks about migration. Obviously small data point, but GitLab going this route shows not a lot of people are switching.
I've never actually seen that status page before, and I'm not clear what it's measuring. My company pays for Enterprise Cloud, and we see all the same downtime as what gets posted to https://www.githubstatus.com/
No the Enterprise Cloud is just the same GitHub.com with the same shitty reliability.
However the newer Enterprise Cloud with Residency (aka on Azure) is a separate partition and has a different reliability domain (still subject to Azures bad reliability so not an entirely compelling offer). This is what you linked.
Gitlab used to be about as reliable as github. (ignoring the security oopses they used to have)
They simply don't have (or didnt) the skills to scale. THey were talking about using ceph to run things (which gives you an idea about how green their infra team was)
Are you implying they should create more in-house solutions, or that specifically Ceph is not a good solution and there is some other 3rd party solution that could be used instead?
Its slow, large, excessively complex and not that resilient to failure.
You either want a bunch of NFS machines backed on to ZFS on nvme, with a central jumping off point that allows sharding (this is critical to allow one or more NFS server to fuck up and not kill access to everything else.)
For parallel read/write access across many thousands of large-ish files (ie multiples of the minimum chunk size) I'm sure it does grand.
But for metadata heavy operations, ie git, its not the FS I would choose. like lustre it can be fast, if your workload aligns with it's tradeoffs. but high metadata loads are not ceph-fs's strong point, (or many other distributed filesystems either)
Its a pattern that works well in VFX, It has the advantage over something like isilon in that hotspots are isolated to individual servers, not across the namespace. So if one of your git stores is being hammered, you can migrate hot/cold repos to other servers fairly simply. Also if one of the server dies, it has a limited blast radius.
The problem with things like ceph-fs (and lustre and to a lesser extent, GPFS, although its not entirely comparable) is that the metadata store is your weak link. Ceph scales great if you have loads of large files where you're read/writing in parallel. (ie pulling thousands of PAR files or images, videos or binaries) it scale almost linearly with the number of object stores. It also works well when your writing to the middle of a file. (far fucking better than s3 like systems)
git is monster metadata eater. Everything git wise is a metadata lookup. That means that when you are running thousands of concurrent git ops on a distributed filesystem, your object throughput will fall off the floor. so you could have 100 ODSs all on 100 gig network with massive nvme stripes, but your global throughput will be shite because your MDS is the limiting factor. You can add more metadata servers, but then ceph is choosing how to shard, not you.
either way, deleting a large git repo, then all your metadata operations start crawling.
This means that you need to think about doing optimisations like keeping git inside a tar or some other container container that are pulled out, loaded in ram, operated on, and forced back as a binary blob. the result means that your thousands of metadata ops are reduced to two or three, and your back to being network bound.
I'm not sure there's a lot to capitalize on, considering the state of hosting OSS development. But this really is a case study on watching your biggest competitor face plant into a wall, and responding by breaking into a head first sprint.