A sequence of characters is grouped into a "token." The set of all such possible sequences forms a vocabulary. Without loss of generality, consider the example: strawberry -> straw | ber | ry -> 3940, 3231, 1029 -> [vector for each token]. The raw input to the model is not a sequence of characters, but a sequence of token embeddings each representing a learned vector for a specific chunk of characters. These embeddings contain no explicit information about the individual characters within the token. As a result, if the model needs to reason about characters, for example, to count the number of letters in a word, it must memorize the character composition of each token. Given that large models like GPT-4 use vocabularies with 100k–200k tokens, it's not surprising that the model hasn't memorized the full character breakdown of every token. I can't imagine that many "character level" questions exist in the training data.
In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.
I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.