It's not just a matter of encountered first, it's of encountered more often ("encountered first" is a special case of "more often"). But you really need to delve into the proof to figure out why that's enough for optimality. At the very least, _I personally_ don't know of a simple intuitive way to explain that.
It is important to remember what each compression algorithm is optimal for.
LZ77 and LZ78 are optimal for observable markov sources -- that is, sources in which given x recent consecutive outputs (where x is the markov order), the distribution of the next symbol is fixed. While this is rarely ever the case, it is reasonable to assume that e.g. English text is a 10-th order markov model (with respect to characters) or 3-rd order (with respect to words).
The source you described is NOT markov overall, although it might be markov in the tail (past the random blob). Asymptotically, of course, it is markov :)
Hmm... I thought that in LZW, once a sequence has been assigned a code, it can never be assigned a shorter code; codes are assigned in order, and they only get longer over time (until a reset is triggered). So the frequency of a particular sequence cannot influence the length of the code assigned to it. I guess I'll read up on it a bit more until I comment further. I do see that it could end up to be optimal in the asymptotic case.
You are, of course, correct. However, in the same way that you cannot reassign codes to the letter A-Z and yet you can compress by assigning a longer code to a sequence (the code is longer but is still shorter than the individual codes needed to represent the sequence), then even though you cannot reassign a code, you will get a new code for a longer sequence that includes the shorter sequence, represents more letters, and overall is more likely to occur than the shorter sequence followed by something else.
Assuming no reset, over time the shorter codes fall out of use because the longer ones eventually include everything they represent, extended by all possibilities (with the more probable extensions getting shorter codes by virtue of appearing more often / earlier).
It is important to remember what each compression algorithm is optimal for.
LZ77 and LZ78 are optimal for observable markov sources -- that is, sources in which given x recent consecutive outputs (where x is the markov order), the distribution of the next symbol is fixed. While this is rarely ever the case, it is reasonable to assume that e.g. English text is a 10-th order markov model (with respect to characters) or 3-rd order (with respect to words).
The source you described is NOT markov overall, although it might be markov in the tail (past the random blob). Asymptotically, of course, it is markov :)