lichess.org
Donate

Testing Maia's Puzzle Performance

My guess is that 1100's have a blunder/move rate higher than 1500's, but don't necessarily all make the same blunders. When you play against Maia trained on millions of 1100's, it's a bit like if for each move, you are playing against a majority vote of all the 1100's, that will smoothen out the bad moves since not all 1100's make the same mistakes, and you end up with a bot that's inevitably stronger than 1100.

Maybe it would be interesting to look not only at the best move for each Maia, but also all the candidate moves, and compare them. A 1100 blunder would be a move that could be made by some 1100, but wouldn't even be considered by 1500.
@LauOnChess said in #2:
> My guess is that 1100's have a blunder/move rate higher than 1500's, but don't necessarily all make the same blunders. When you play against Maia trained on millions of 1100's, it's a bit like if for each move, you are playing against a majority vote of all the 1100's, that will smoothen out the bad moves since not all 1100's make the same mistakes, and you end up with a bot that's inevitably stronger than 1100.

Right on the money

Trying to predict the most common move a class of player would play in a position is quite different from playing like a player of this class.
@LauOnChess said in #2:
> My guess is that 1100's have a blunder/move rate higher than 1500's, but don't necessarily all make the same blunders. When you play against Maia trained on millions of 1100's, it's a bit like if for each move, you are playing against a majority vote of all the 1100's, that will smoothen out the bad moves since not all 1100's make the same mistakes, and you end up with a bot that's inevitably stronger than 1100.
>
> Maybe it would be interesting to look not only at the best move for each Maia, but also all the candidate moves, and compare them. A 1100 blunder would be a move that could be made by some 1100, but wouldn't even be considered by 1500.

My hope was that the bad moves that are left after the majority vote would then be typical mistakes of players in the class. The smoothing out of the bad moves in itself shouldn't be a problem, especially since the models were trained on blitz games, so there will anyway be many bad moves that only happen due to the short time control.

What really surprised me was that the gap between the 1100 and 1900 models was very small. I didn't think this would happen, since the 1900 model would also filter out the individual bad moves of 1900 players.
You have very good questions. I have always been concerned about approximating human play for the learners among us, as a mere error model, based on some notion of best play, we are not even sure is that, for all positions. But even if it were best play.

Well, even more specific, my concern: that it would be attributable as sole dependent variable to a rating level.

Using a macroscopic variable as the only visible information feeding the model of player behavior on the board for all possible positions (that is what an engine is doing, once trained or made executable bot).

That chess learning might be like the theories of learning that are rarely revisited seem to have as common sense assumption as well.

There might be statistical trends of mistakes, that coaches might even be able to tell. But is that enough to build a playing engine on that. Is the engine having a notion of better play, and has an error model on top of it, that behaves systematically like the average of all players with the same rating?

And that is why I find your going into puzzles a good idea. If we could also analyze the thematics population dynamics beyond the database available to us (assuming it is frozen for its thematic spectra, ok, that is already something).

I need to read the blog further.
@jk_182 said in #4:
> the models were trained on blitz games, so there will anyway be many bad moves that only happen due to the short time control.

Exactly, how attributable to the player learned state or skill set are any or all error on a position. Some might be about aleas that have nothing to do with their skills set on average, yet if a statistical trend conditioned by the rating variable captured by the error model with respect to some best play, independent of the position information?

That is where I am not sure what the error model is about, what does it capture? Do you have a better understanding. Did I already misconceive what the Maias are implementing as human player model?

I know they have improved on that model degrees of freedom, but is that still separable as I am describing, a given best play model, not informed by the data, and only the error model being so. And then, is it aware of position information differences, that there might be sub-classes of specific position to error type hidden within the same rating class..

I am not even sure my questions are well-formed, or enough to make sense as expressed.

So in the quote I just isolated, it seems you are asking similar questions already, and considering the time pressure as another variable. What about the experience set trajectory of each individual? Some how, if there was some board information metric about their experience set in all their past games of the position world as part of the dependent data set could there no be information there to start having a more human diverse model of errors, that do not lump all human learning trajectories together.

Notions of repertoire made position features sets or subsets. does any repertoire (or effective data analysed for its position feature set specialization versus all possible repertoire specializations put together for the same thing), have the same characteristics of position world experience, and could such position feature subset specialization or position "repertoire" have its own effect on the error model.

Also, why did the maia not use the A0 or LC0 learning algorithm itself (or did they) and hybridize a learned model there with the human population data, rather than my idea that they have been working from some already trained to "perfection" model, and they use an occam razor model or error that we are thinking or banging our heads about.
They could even have error models if they really insist of those goggles. Evaluation head error and policy head error.
> The team behind the project has also written another paper where they made individualised engines in order to predict moves by specific players. They had very interesting results and I've written about them here.

Sorry, I had forgotten about the individualized. Thanks for reminding. I will read what you wrote about it first. And look at it again. It might address some of my concerns/questions/musing about subclass within bands category of the model construction. It was not on Lichess (excuses...).

But I might need to become more informed from that modelling data (the paper and your lighting of it). On my to-do list. Of many loose ends forever irons. (well, I don't know that yet, I only will at the end...).
@jk_182 said in #4:
> What really surprised me was that the gap between the 1100 and 1900 models was very small. I didn't think this would happen, since the 1900 model would also filter out the individual bad moves of 1900 players.

If the 1100 player plays 20% bad moves, 70% average moves, 10% good moves, and the 1900 player plays 10% bad moves, 70% average moves, and 20% good moves... The 1900 player will play much better, but if you take the more common move in different positions, it will usually be the average move, not the bad move or the good move. This is a simplified representation, but it's probably more or less what is happening.

The Maia models would probably offer a better representation if instead of always playing the top most likely move guessed by the model, the move selection code randomized the move that is going to get played based on the weights assigned to each of the top 5 predicted moves.
Engines generally aren't that good at solving tactics/puzzle, even SF struggles with some puzzles that a human would solve. Some puzzles seem impossible to solve because they're simply too many moves deep and will not be solved by SF even when ran at a supercomputer for several hours, but might be solved by a decent human player because they might recognize a repeating pattern or theme/motif.