The thing I still don’t understand is how DeepSeek built the base model cheaply, and why their models seem to think they are GPT4 when asked. This article says the base model is from their previous paper, but that paper also doesn’t make clear what they trained on. The earlier paper is mostly a description of optimization techniques they applied. It does mention pretraining on 14.8T tokens with 2.7M H800 GPU hours to produce the base DeepSeek-V3. But what were those tokens? The paper describes the corpus only in vague ways.
Various other models also think they're ChatGPT or built by OpenAI, or at least those are the highest probability tokens when talking about an AI model or an AI company because of the massive prevalence in training data (the internet). It isn't the big reveal that it is often being held to be.
Add that training off of ChatGPT wouldn't reduce their training costs at all, but would actually increase their training costs. Literally all of the same training difficulty, but then add paying OpenAI for an enormous number of API calls. Not really seeing the win.
>The paper describes the corpus only in vague ways.
Anyone who runs a public website has logs absolutely filled by a seemingly infinite number of information aggregators. Just like everyone else they scraped the entire internet, pulled in all of Wikipedia, etc. Probably lots of pirate books, movie transcripts, etc.
The fact that training could be done more effectively is something that intuitively makes absolute sense to everyone in the field, but we just didn't make that leap. Similar to how a human isn't trained to recognize digits by training on 60,000 training digits then suddenly failing if a real world digit is slightly rotated or morphed in some way, we are making these improvements to content ingestion.
I imagine it's a mix of either using ChatGPT as an the oracle to get training data. Or, it's the radiocarbon issue where the Internet has so much info on ChatGPT other models now get confused.
Er, how would that reduce the cost? You still need to train the model, which is the expensive bit.
Also, the base model for V3 and the only-RL-tuned R1-Zero are available, and they behave like base models, which seems unlikely if they used data from OpenAI as their primary data source.
It's much more likely that they've consumed the background radiation of the web, where OpenAI contamination is dominant.
Hypothetical question: is the chinese government capable of exploiting chatgpt to get around the query limit? For example, making queries through compromised devices or even snooping local traffic on devices? Let's face it, these models are closely alligned with China's national security so it's not a farfetched question to ask.
You can't distill from GPT-4 because Open AI conceals the probabilities (and has for a couple years now-- since before gpt4), presumably to prevent that. You can fine tune against output though. I might guess that they used something like openorca or some other public data set that includes gpt4 output as part of their initial fine tuning.
How does such a distillation work in theory? They don’t have weights from OpenAI’s models, and can only call their APIs, right? So how can they actually build off of it?
They fixed that. Now it replies: "Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation."