BLOOM isn’t as good as GPT-3 because it doesn’t use as much training data. LLM q...

CuriouslyC · on Jan 24, 2023

As the scale of input data goes up linearly, the scale of commonly observed input patterns goes up logarithmically. If we bumped the scale up an order of magnitude in terms of common input tokens, that still means we could annotate the important part of a 150TB text corpus for 125B worth of human annotation. Given that could break the budget of even large corporations, realistically we'd probably train a model to predict the scores of interest using a fraction of that much human annotation, which would be inferior but still a massive improvement. It is also likely that corporations would team up with indirect competitors to share the cost of annotation and gain an advantage against direct competitors.

rcme · on Jan 24, 2023

How do you figure? Let's say a commonly observed input pattern comprises 1% of training data. For a data set of size N, 0.01 * N examples will contain the pattern. If we increase the size to 2N, 0.01 * 2 * N examples will contain the pattern. Why is the growth logarithmic?

CuriouslyC · on Jan 24, 2023

As the data set size increases, the average frequency of non-trivial most frequent patterns will go down, and the tail will get much larger. Thus if you had a 1% cutoff, the percentage of the data set hitting this cutoff goes down as the data set size goes up. Take a look at pareto distributions with high alpha to understand the statistics of it.

Of course, this is only true if new data is distinct from old data. If you just copied your data set 10x and pretended it was a 10x larger data set, it would behave like you expect.

rcme · on Jan 24, 2023

Hmm I’m still not convinced. Gather training data can be thought of sampling the underlying distribution of the data. In that sense, you’d expect the proportions of things to converge towards the underlying distribution as you gather more data.

CuriouslyC · on Jan 25, 2023

That would be true if we were sampling from the underlying distribution in an unbiased and balanced way from the beginning. Instead data is generated and incorporated one set at a time, and each set is biased. Jargon and terms vary, but the language plumbing is the same - new sets bolster common phrases/idioms and lengthen the tail with specific tokens.

Keep in mind though, language isn't a stationary process.

rcme · on Jan 25, 2023

Even if each dataset is biased, I’m not still not sure how you derived logarithmic growth from the general notion of bias in data. For instance, assuming the data is biased, perhaps it is biased in the other direction and contains more common patterns compared to the underlying distribution.

CuriouslyC · on Jan 25, 2023

There is a lot to this subject, it might be easier if you took a look at https://martinapugliese.github.io/data/heaps-law-languages/.

Note that when plotting corpus size vs unique words, the log plot is expected to be linear.

rcme · on Jan 25, 2023

Ah, I see what you mean: the number of unique examples increases logarithmicly with data size, which kind of makes sense. Language, in this case, follows a power law.

I think you argument is that this means smaller datasets are ok because they contain "most" of what the larger datasets contains. But I think this data-power-rule implies the opposite. ML models can often get to 80-90% accuracy on some task. Unfortunately, these models often aren't that useful because that missing 10% of accuracy matters a lot to users. So what this data-power-rule implies is that, in order to get the last 10% of gains, you need 10x the amount of data.

CuriouslyC · on Jan 25, 2023

Well, to get back to my original point, if we're trying to improve the quality and accuracy of model writing, and we want to do that by adding quality and accuracy scores to short token sequences, the power law distribution means we could get coverage on a significant portion of the data set by scoring just the most frequent sequences that aren't linguistic trivia. We could probably get to 50% average coverage fairly cheaply, and while diminishing returns would kick in and make getting to 80 or 90% much more expensive, at that point we could use a model to estimate the remainder, and have a perfectly suitable quality/accuracy scores to condition the model on. The model would output those quality/accuracy scores for the generated token sequence as well, so portions of output that were low quality/of questionable accuracy could be flagged.