Modern multimodal encoders for LLMs are fine/not lossy since they do not resize to a small size and can handle arbitrary sizes, although some sizes are obviously better represented in the training set. A 8.5" x 11" paper would be common.
I suspect the issue is prompt engineering related.
> Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.
> - Use the top-left coordinate system
> - Values should be percentages of the image width and height (0 to 1)
LLMs have enough trouble with integers (since token-wise integers and text representation of integers are the same), high-precision decimals will be even worse. It might be better to reframe the problem as "this input document is 850 px x 1100 px, return the bounding boxes as integers" then parse and calculate the decimals later.