Very cool open release. Impressive that a 27b model can be as good as the much b...

wizee · 2025-03-12T18:23:28 1741803808

It seems to have been very benchmark-tuned for LMArena. In my own experiments, it was roughly in line with other comparably sized models for factual knowledge (like Mistral Small 3), and worse than Mistral Small 3 and Phi-4 at STEM problems and logic. It's much worse than Llama 3.3 70b or Mistral Large 2411 in knowledge or intelligence in reality, even though LMArena ranks it as better than those.

aoeusnth1 · 2025-03-12T15:42:23 1741794143

Looking at every other benchmark, it's significantly behind typical big models from a year ago (Claude 3.0, Gemini 1.5, GPT 4.0). I think Google must have extensive LMArena-focused RLHF tuning for their models to juice their scores.

vessenes · 2025-03-12T08:56:35 1741769795

I was thinking the same thing about the receipt calculation: a warning that only tourists tip 18% in Switzerland would no doubt have been appreciated!