←back to thread

365 points lawrenceyan | 1 comments | | HN request time: 0.212s | source
Show context
bob1029 ◴[] No.41880706[source]
https://arxiv.org/abs/2402.04494

> Board states s are encoded as FEN strings which we convert to fixed-length strings of 77 characters where the ASCII-code of each character is one token. A FEN string is a description of all pieces on the board, whose turn it is, the castling availability for both players, a potential en passant target, a half-move clock and a full-move counter. We essentially take any variable-length field in the FEN string, and convert it into a fixed-length sub-string by padding with ‘.’ if needed. We never flip the board; the FEN string always starts at rank 1, even when it is the black’s turn. We store the actions in UCI notation (e.g., ‘e2e4’ for the well-known white opening move). To tokenize them we determine all possible legal actions across games, which is 1968, sort them alphanumerically (case-sensitive), and take the action’s index as the token, meaning actions are always described by a single token (all details in Section A.1).

I am starting to notice a pattern in these papers - Writing hyper-specific tokenizers for the target problem.

How would this model perform if we made a small change to the rules of chess and continued using the same tokenizer? If we find we need to rewrite the tokenizer for every problem variant, then I argue this is just ordinary programming in a very expensive disguise.

replies(3): >>41881003 #>>41881108 #>>41881117 #
1. mewpmewp2 ◴[] No.41881117[source]
I don't know much about this space, but it seems like this could be solved by leaving a good amount of empty tokens that you would only start using when they arise. Or leave tokens which you can use together to combine anything for various edge cases. Because if you have all the characters as tokens you can combine them into anything.