Very cool open release. Impressive that a 27b model can be as good as the much bigger state of the art models (according to their table of Chatbot Arena, tied with O1-preview and above Sonnet 3.7).
But the example image shows that this model still makes dumb errors or has a poor common sense although it read every information correctly.
It seems to have been very benchmark-tuned for LMArena. In my own experiments, it was roughly in line with other comparably sized models for factual knowledge (like Mistral Small 3), and worse than Mistral Small 3 and Phi-4 at STEM problems and logic. It's much worse than Llama 3.3 70b or Mistral Large 2411 in knowledge or intelligence in reality, even though LMArena ranks it as better than those.
Looking at every other benchmark, it's significantly behind typical big models from a year ago (Claude 3.0, Gemini 1.5, GPT 4.0). I think Google must have extensive LMArena-focused RLHF tuning for their models to juice their scores.
But the example image shows that this model still makes dumb errors or has a poor common sense although it read every information correctly.