Aren't all frontier models already able to use all these languages? Support for ...

melvinmelih · 2025-10-28T15:18:53 1761664733

> because they are trained on multilingual data

But they were not trained on government-sanctioned homegrown EU data.

sunaookami · 2025-10-28T16:04:41 1761667481

Who in their right mind would use this?

tensor · 2025-10-28T16:27:01 1761668821

I'd use a model trained on a targeted and curated data set over one trained on all the crap on the internet any day.

loandbehold · 2025-10-28T19:27:55 1761679675

I keep hearing that LLMs are trained on "Internet crap" but is it true? For instance we know from Anthropic copyright case that they scanned millions of books to make a training set. They certainly use Internet content for training but I'm sure it's curated to a large degree. They don't just scrap random pages and feed into LLM.

airspresso · 2025-10-28T22:55:00 1761692100

> I keep hearing that LLMs are trained on "Internet crap" but is it true?

Karpathy repeated this in a recent interview [0], that if you'd look at random samples in the pretraining set you'd mostly see a lot of garbage text. And that it's very surprising it works at all.

The labs have focused a lot more on finetuning (posttraining) and RL lately, and from my understanding that's where all the desirable properties of an LLM are trained into it. Pretraining just teaches the LLM the semantic relations it needs as the foundation for finetuning to work.

[0]: https://www.dwarkesh.com/p/andrej-karpathy

ACCount37 · 2025-10-29T11:48:49 1761738529

"Just" is the wrong way to put it.

Pretraining teaches LLMs everything. SFT and RL is about putting that "everything" into useful configurations and gluing it together so that it works better.

nutjob2 · 2025-10-28T20:56:05 1761684965

> I'm sure it's curated to a large degree. They don't just scrap random pages and feed into LLM.

How would they curate it on that scale? Does page ranking (popularity) produce interesting pages for this purpose? I'm skeptical.

ACCount37 · 2025-10-29T11:41:57 1761738117

It is true. Datasets are somewhat cleaned, but only somewhat. When you have terabytes worth of text, there's only so much cleaning you can do economically.

sunaookami · 2025-10-29T17:52:04 1761760324

We are talking about government-curated data here, the bias should be obvious. Popular LLMs still have huge bias problems but it would be way worse with only government-curated data.

saretup · 2025-10-28T15:37:03 1761665823

The entirety of the internet vs government-sanctioned homegrown EU data.

tonyhart7 · 2025-10-28T15:44:33 1761666273

"But they were not trained on government-sanctioned homegrown EU data."

ok what are you implying on this

mock-possum · 2025-10-29T06:24:14 1761719054

Sidesteps potential legal issues probably

raverbashing · 2025-10-28T15:43:30 1761666210

> But they were not trained on government-sanctioned homegrown EU data.

If none of the LLM makers used the very big corpus of EU multilingual data I have an EU regulation bridge to sell it to you

tensor · 2025-10-28T16:25:44 1761668744

No, that's not how training works. It's not just about having an example in a given language, but also how many examples and the ratio of examples compared to other languages. English hugely eclipses any other language on most US models and that's why performance on other languages is subpar compared to performance on english.

Byamarro · 2025-10-28T19:18:40 1761679120

There's actually a research showing that llms are more accurate when questions are in Polish: https://arxiv.org/pdf/2503.01996

megous · 2025-10-29T02:34:17 1761705257

My first impulse is to say that some languages have better SNR on the internet. (less garbage autogenerated or SEO content compared to useful information)

andy12_ · 2025-10-28T17:40:07 1761673207

I have never noticed any major difference in performance of ChatGPT between English and Spanish. The truth is that as long as the amount of training data of a given language is above some threshold, knowledge transfers between languages.

FinnKuhn · 2025-10-29T08:21:42 1761726102

The issue starts, when an LLM's transfers knowledge between languages, even though that knowledge is not correct in that language. I have seen this with e.g. ChatGPT answers regarding laws for example where it refers to US laws when asked in German, which are obviously not relevant.

dragonwriter · 2025-10-29T08:26:30 1761726390

> The issue starts, when an LLM's transfers knowledge between languages, even though that knowledge is not correct in that language. I have seen this with e.g. ChatGPT answers regarding laws for example where it refers to US laws when asked in German, which are obviously not relevant.

There is no necessary correlation between language and the correct set of laws to reference. The language of the question (or the answer, if for some reason they are not the same) is an orthogonal issue to the intended scope. There is no reason US laws couldn't be the relevant to a question asked in German (and, conversely, no reason US laws couldn't be wrong for a question asked in English, even if it was specifically and distinguishably US English.)

FinnKuhn · 2025-10-29T08:59:02 1761728342

When you ask an LLM (in German) without further clarifying your location I expect it to refer to German (or Austrian/Swiss) laws.

For most questions it does this pretty well (e.g. asking for the legal age to drink). However once the answer becomes more complex it starts to halucinate very quickly. The fact that some of the hallucinations are just translated US laws makes me think that the knowledge transfer between languages is probably not helping in instances like this.

voxgen · 2025-10-28T19:45:57 1761680757

Ratio/quantity is important, but quality is even more so.

In recent LLMs, filtered internet text is at the low end of the quality spectrum. The higher end is curated scientific papers, synthetic and rephrased text, RLHF conversations, reasoning CoTs, etc. English/Chinese/Python/JavaScript dominate here.

The issue is that when there's a difference in training data quality between languages, LLMs likely associate that difference with the languages if not explicitly compensated for.

IMO it would be far more impactful to generate and publish high-quality data for minority languages for current model trainers, than to train new models that are simply enriched with a higher percentage of low-quality internet scrapings for the languages.

charlieyu1 · 2025-10-28T22:05:15 1761689115

Training is a very different thing. Can’t speak for European, but LLMs are often much worse in Japanese because tokenisation used Unicode and a single Japanese character often has to be represented by more than one token

unscaled · 2025-10-29T02:29:05 1761704945

I think you meant to say that tokenization is usually done with UTF-8 and a single Japanese character generally takes 3 or more code units (i.e. bytes). Unicode itself is not the culprit (in fact, even with UTF-16 tokenization, most Japanese characters would fit in a single code unit, and the ones that won't are exceedingly rare).

I have to admit I have not encountered significant mistokenization issues in Japanese, but I'm not using it on a daily basis LLMs. I'm somewhat dobutful this can be a major issue, since frontier LLMs are absolutely in love with Emoji, and Emoji requires at least 4 UTF-8 bytes, while most Japanese characters are happy with just 3 bytes.

intended · 2025-10-28T15:58:08 1761667088

Nope. Capability begins to degrade once you move away from english.

Plus all your T&S/AI Safety is not solved with translation, you need lexicons and data sets of examples.

Like, people use someone in Malaysia, to label the Arabic spoken by someone playing a video game in Doha - the cultural context is missing.

The best proxy to show the degree of lopsidedness was from this : https://cdt.org/insights/lost-in-translation-large-language-...

Which in turn had to base it on this: https://stats.aclrollingreview.org/submissions/linguistic-di...

From what I am aware of, LLM capability degrades once you move out of English, and many nation states are either building, or considering the option of building their own LLMs.

numpad0 · 2025-10-28T16:34:06 1761669246

Not natively, they all sound translated in languages other than English. I occasionally come across French people complaining about LLMs' use of non-idiomatic French, but it's probably not a French problem at all, considering that this effort includes so many Indo-European languages.

FinnKuhn · 2025-10-28T16:50:21 1761670221

I can at least also confirm this for German. Here is one example that is quite annyoing:

Chat GPT for example tends to start emails with "ich hoffe, es geht dir gut!", which means "I hope you are well!". In English (especially American) corporate emails this is a really common way to start an email. In German it is not as "how are you" isn't a common phrase used here.

ideasarecool · 2025-10-29T08:04:24 1761725064

Term support is vague. Can you do basic interaction in most other languages? Sure. Is it anywhere close to competence it has in english? No. Most models seem to just translate english responses at beginners simplistic monotone level.

whazor · 2025-10-28T17:45:03 1761673503

European governments have huge collections of digitalised books, research, public data.

But also European culture could maybe make a difference? You can already see big differences between Grok and ChatGPT in terms of values.

pembrook · 2025-10-28T18:02:07 1761674527

If it's publicly available data, books and research, I can assure you the big models have already all been trained on it.

European culture is already embedded in all the models, unless the people involved in this project have some hidden trove of private data that they're training on which diverges drastically from things Europeans have published publicly (I'm 99.9% positive they don't...especially given Europe's alarmist attitude around anything related to data).

I think people don't understand a huge percentage of the employees at OpenAI, Anthropic, etc. are non-US born.

lm28469 · 2025-10-28T15:24:08 1761665048

Meh, it depends a lot on the dataset, which are heavily skewed towards the main languages. For example they almost always confuse Czech and Slovak and often swap one for the other in middle of chats

mirekrusin · 2025-10-28T15:32:34 1761665554

But the only way to unskew it is to remove main language data because there isn't really any to add, no?

tensor · 2025-10-28T16:28:17 1761668897

You can also correctly bias your sampling so that when selecting new training instances each language is chosen equally. Generally the diversity of data is good, unless that data is "wrong" which, ironically, is probably most of the internet, but I digress.

RobotToaster · 2025-10-28T16:29:16 1761668956

Aren't they about as different as American English and British English?

svobodovic · 2025-10-28T19:40:13 1761680413

The difference ia larger than let's say just a "dialect". They really are different languages, even though we generally understand each other quite well (younger generations less so). I've heard it's about as different as e. g. Danish and Swedish - not sure if that comparison is helpful.