> How many human beings do you personally know who were able to solve a dynamic ...

pedrosorio · on Feb 2, 2022

I understand wanting to look at different metrics to gauge progress, but what is the issue with this?

> not elo rating because that depends on a human reference

ahgamut · on Feb 2, 2022

The Turing Test (https://en.wikipedia.org/wiki/Turing_test) for artificial intelligence required the machine to convince a human questioner that it was a human. Since then, most AI methods rely on a human reference of performance to showcase their prowess. I don't find this appealing because:

1) It's an imprecise target: believers can always hype and skeptics can always downplay improvements. Humans can do lots of different things somewhat well at the same time, so a machine beating human-level performance in one field (like identifying digits) says little about other fields (like identifying code vulnerabilities).

2) ELO ratings, or similar metrics are measurements of skill, and can be brute-forced to some extent, equivalent to grinding up levels in a video game. Brute-forcing a solution is "bad", but how do we know a new method is "better/more elegant/more efficient"? For algorithms we have Big-O notation, so we know (brute force < bubble sort < quick sort), perhaps there is an analogue for machine learning.

I would like performance comparisons that focus on quantities unique to machines. I don't compare the addition of computer processors with reference to human addition, so why not treat machine intelligence similarly?

There are many interesting quantities with which we can compare ML models. Energy usage is a popular metric, but we can also compare the structure of the network, the code used, the hardware, the amount of training data, the amount of training time, and the similarity between training and test data. I think a combination of these would be useful to look at every time a new model arrives.

sibeshk96 · on Feb 2, 2022

Using my previous chess analogy, the world's smartest chess bot has played a million games to beat the average grandmaster, who has played less than 10,000 games in her lifetime. So while they both will have the same elo rating, which is a measure of how well they are at the narrow domain of chess, there is clearly something superior about the how the human grandmaster learns from just a few data points i.e. strong generalization vs the AI's weak generalization. Hence the task-specific elo rating does not give enough context to understand how well a model adapts to uncertainty. For instance - a Roomba would beat a human hands down if there was an elo rating for vacuuming floors.