Aren't all frontier models already able to use all these languages? Support for specific languages doesn't need to be built in, LLMs support all languages because they are trained on multilingual data.
I keep hearing that LLMs are trained on "Internet crap" but is it true? For instance we know from Anthropic copyright case that they scanned millions of books to make a training set. They certainly use Internet content for training but I'm sure it's curated to a large degree. They don't just scrap random pages and feed into LLM.
> I keep hearing that LLMs are trained on "Internet crap" but is it true?
Karpathy repeated this in a recent interview [0], that if you'd look at random samples in the pretraining set you'd mostly see a lot of garbage text. And that it's very surprising it works at all.
The labs have focused a lot more on finetuning (posttraining) and RL lately, and from my understanding that's where all the desirable properties of an LLM are trained into it. Pretraining just teaches the LLM the semantic relations it needs as the foundation for finetuning to work.
Pretraining teaches LLMs everything. SFT and RL is about putting that "everything" into useful configurations and gluing it together so that it works better.
It is true. Datasets are somewhat cleaned, but only somewhat. When you have terabytes worth of text, there's only so much cleaning you can do economically.
We are talking about government-curated data here, the bias should be obvious. Popular LLMs still have huge bias problems but it would be way worse with only government-curated data.
No, that's not how training works. It's not just about having an example in a given language, but also how many examples and the ratio of examples compared to other languages. English hugely eclipses any other language on most US models and that's why performance on other languages is subpar compared to performance on english.
My first impulse is to say that some languages have better SNR on the internet. (less garbage autogenerated or SEO content compared to useful information)
I have never noticed any major difference in performance of ChatGPT between English and Spanish. The truth is that as long as the amount of training data of a given language is above some threshold, knowledge transfers between languages.
The issue starts, when an LLM's transfers knowledge between languages, even though that knowledge is not correct in that language. I have seen this with e.g. ChatGPT answers regarding laws for example where it refers to US laws when asked in German, which are obviously not relevant.
> The issue starts, when an LLM's transfers knowledge between languages, even though that knowledge is not correct in that language. I have seen this with e.g. ChatGPT answers regarding laws for example where it refers to US laws when asked in German, which are obviously not relevant.
There is no necessary correlation between language and the correct set of laws to reference. The language of the question (or the answer, if for some reason they are not the same) is an orthogonal issue to the intended scope. There is no reason US laws couldn't be the relevant to a question asked in German (and, conversely, no reason US laws couldn't be wrong for a question asked in English, even if it was specifically and distinguishably US English.)
When you ask an LLM (in German) without further clarifying your location I expect it to refer to German (or Austrian/Swiss) laws.
For most questions it does this pretty well (e.g. asking for the legal age to drink). However once the answer becomes more complex it starts to halucinate very quickly. The fact that some of the hallucinations are just translated US laws makes me think that the knowledge transfer between languages is probably not helping in instances like this.
Ratio/quantity is important, but quality is even more so.
In recent LLMs, filtered internet text is at the low end of the quality spectrum. The higher end is curated scientific papers, synthetic and rephrased text, RLHF conversations, reasoning CoTs, etc. English/Chinese/Python/JavaScript dominate here.
The issue is that when there's a difference in training data quality between languages, LLMs likely associate that difference with the languages if not explicitly compensated for.
IMO it would be far more impactful to generate and publish high-quality data for minority languages for current model trainers, than to train new models that are simply enriched with a higher percentage of low-quality internet scrapings for the languages.
Training is a very different thing. Can’t speak for European, but LLMs are often much worse in Japanese because tokenisation used Unicode and a single Japanese character often has to be represented by more than one token
I think you meant to say that tokenization is usually done with UTF-8 and a single Japanese character generally takes 3 or more code units (i.e. bytes). Unicode itself is not the culprit (in fact, even with UTF-16 tokenization, most Japanese characters would fit in a single code unit, and the ones that won't are exceedingly rare).
I have to admit I have not encountered significant mistokenization issues in Japanese, but I'm not using it on a daily basis LLMs. I'm somewhat dobutful this can be a major issue, since frontier LLMs are absolutely in love with Emoji, and Emoji requires at least 4 UTF-8 bytes, while most Japanese characters are happy with just 3 bytes.
From what I am aware of, LLM capability degrades once you move out of English, and many nation states are either building, or considering the option of building their own LLMs.
Not natively, they all sound translated in languages other than English. I occasionally come across French people complaining about LLMs' use of non-idiomatic French, but it's probably not a French problem at all, considering that this effort includes so many Indo-European languages.
I can at least also confirm this for German. Here is one example that is quite annyoing:
Chat GPT for example tends to start emails with "ich hoffe, es geht dir gut!", which means "I hope you are well!". In English (especially American) corporate emails this is a really common way to start an email. In German it is not as "how are you" isn't a common phrase used here.
Term support is vague. Can you do basic interaction in most other languages? Sure. Is it anywhere close to competence it has in english? No. Most models seem to just translate english responses at beginners simplistic monotone level.
If it's publicly available data, books and research, I can assure you the big models have already all been trained on it.
European culture is already embedded in all the models, unless the people involved in this project have some hidden trove of private data that they're training on which diverges drastically from things Europeans have published publicly (I'm 99.9% positive they don't...especially given Europe's alarmist attitude around anything related to data).
I think people don't understand a huge percentage of the employees at OpenAI, Anthropic, etc. are non-US born.
Meh, it depends a lot on the dataset, which are heavily skewed towards the main languages. For example they almost always confuse Czech and Slovak and often swap one for the other in middle of chats
You can also correctly bias your sampling so that when selecting new training instances each language is chosen equally. Generally the diversity of data is good, unless that data is "wrong" which, ironically, is probably most of the internet, but I digress.
The difference ia larger than let's say just a "dialect". They really are different languages, even though we generally understand each other quite well (younger generations less so). I've heard it's about as different as e. g. Danish and Swedish - not sure if that comparison is helpful.