you've never fully implemented raft or paxos. it is well known that such distributed consensus libraries can be extremely easy (a couple of days coding) to start with, but it can take ages to get it polished to the extent to be production ready. a few things to ask yourself -
1. do you support membership change, checkpoint/snapshot streaming, automated consistency check, leadership transfer, witeness, etc?
2. what are the design decisions made to tolerate large latency, i.e. running your raft/paxos in geographically distributed data centres?
3. have you tried to let your "implementation" to pass any 3rd party test suite? what is that test suite? how complete it is? maybe start with the ETCD raft test suite?
4. ever tried to test your "implementation" with Japsen or at least some monkey tests? can your "implementation" survive after very high churn rate?
5. how about hardware errors? will that lead to data corruption? show me your tests please.
once you feel comfortable with the answers above, think about another quick question - how can you support tens of thousands of raft/paxos groups on a 3 nodes setup? what changes are required to make it efficient? extra testing required? when further scaling to many nodes, how to manage the distribution of those large number of groups?
your "class projects" is nothing else than a slightly more advanced "hello world" program - it proves one thing and one thing only - you are capable of reading a simplified paper written in English.
now compare it to coding interview again, is it still "an order of magnitude less difficult"?
I wrote a raft implementation that is used by quite a few well known systems (etcd, cockroachdb, docker swarmkit, tikv and quite a few closed source systems). Probably it qualifies the "production ready" standard :P.
With that experience, I can say that a real world raft implementation is not easy and VERY time consuming.
However, preparing the coding interview is definitely a order of magnitude more difficult for me. It requires me to waste all my time on something meaningless, and makes me feel sick.
it took etcd about 2 years to become mature on its _single_ group raft implementation. the abstraction in raft.go is pretty good, the test suite is the best I can find, message passing and tick handling is correctly handled, but to be honest, everything else (entry management, transport, snapshot streaming, request handling etc) need to be rewritten/added to further scale it to support multi raft groups. that is actually what cockroachdb did.
etcd raft is great, just want to give my understanding on how time consuming it is to write a production ready raft library.
Okay, fair, we didn't do multiple Raft groups or membership changes. If I can find the code, I'll dust it off and give those test suites a try. The project's test suite had a 0MQ broker that could be programmed to selectively drop and delay messages while operating the KV store implemented by the student code. This is pretty similar to Jepsen, though Jepsen might have scenarios that the instructors didn't consider.
I didn't claim it was easy to implement tens of thousands of Raft groups among the same machines (why do you need that?) or to be resilient to Byzantine failures (if that's what you mean by hardware errors).
I don't think we are fundamentally disagreeing: you argue it's difficult to implement various extensions to Raft, we seem to agree it's not that hard to implement the core of Raft.
1. do you support membership change, checkpoint/snapshot streaming, automated consistency check, leadership transfer, witeness, etc?
2. what are the design decisions made to tolerate large latency, i.e. running your raft/paxos in geographically distributed data centres?
3. have you tried to let your "implementation" to pass any 3rd party test suite? what is that test suite? how complete it is? maybe start with the ETCD raft test suite?
4. ever tried to test your "implementation" with Japsen or at least some monkey tests? can your "implementation" survive after very high churn rate?
5. how about hardware errors? will that lead to data corruption? show me your tests please.
once you feel comfortable with the answers above, think about another quick question - how can you support tens of thousands of raft/paxos groups on a 3 nodes setup? what changes are required to make it efficient? extra testing required? when further scaling to many nodes, how to manage the distribution of those large number of groups?
your "class projects" is nothing else than a slightly more advanced "hello world" program - it proves one thing and one thing only - you are capable of reading a simplified paper written in English.
now compare it to coding interview again, is it still "an order of magnitude less difficult"?