Language detection in the Bash command line using gzip (2011)

anitil · on Nov 11, 2020

What an interesting thing!

This relies on taking corpuses (corpii?) of a fixed length, concatenating the unknown string and compressing the lot. The one that compresses better is likely to be the one that is the same language.

I have an intuitive understanding of _why_ this is likely to be (words/phrases in the new string are likely to appear in the corpus of the same language and so will compress more easily) but I feel like I'm glimpsing at something just out of reach.

Unfortunately I can't find the original lectures, but I'll keep looking

optimalsolver · on Nov 11, 2020

Of possible interest:

http://mattmahoney.net/dc/dce.html#Section_14

anitil · on Nov 13, 2020

ooh thankyou!